Question

What distinguishes fake news from real news?

We’ve all been told growing up, “Don’t believe everything you read on the Internet.” In the past, this simply meant avoiding obvious scams and hoaxes, like emails from Nigerian princes suspiciously promising you a share of their wealth. However, in this day and age, the proliferation of digital media in modern society has enabled a far more insidious practice: the writing and mass distribution of so-called “news” articles that seem true, and in fact may even incorporate some truthful elements, to push false or misleading claims.

Other investigations into fake news are plentiful and have been extensively covered in the media — for example, Politico recently reported that “millions of users” have been exposed to fake news about coronavirus, with 40% of claims officially debunked by fact checkers still spreading on the platform. More relevant to our timeframe, there were questions as to the extent to which Russian propaganda was distributed on Facebook as fake news during the 2016 election.

As Cathy O’Neil warned us in Weapons of Math Destruction, “about two-thirds of American adults have a profile on Facebook. They spend thirty-nine minutes a day on the site, only four minutes less than they dedicate to face-to-face socializing. Nearly half of them, according to a Pew Research Center report, count on Facebook to deliver at least some of their news…” It goes without saying that Facebook is not exactly a hub of reputable reporting, so the fact that this many people get their news from such an unreliable platform is deeply concerning for anyone concerned about having a well-informed populace. Thus, we wanted to explore, in a data-driven way, what characterizes fake news and how we might be able to identify it.

Exploratory Data Analysis

First, we load in the data.

real_news = read_csv("True.csv")
fake_news = read_csv("Fake.csv")
data = bind_rows(real_news %>% mutate(truth = "Real"),
                 fake_news %>% mutate(truth = "Fake")) %>% 
  mutate(formatted_date = mdy(date))

The first thing we do with the dataset is clean it in order to make it possible for us to perform text analysis. This means running it through a “tokenizer,” which breaks up the raw text data into individual words (and other components/“tokens”, but mostly just words). Next, we filter out numbers and “stop words” — these are words that are syntactically necessary, but don’t actually contribute to a sentences’ meaning, like “the”, “a”, “and”, etc.

cleaned = data %>% 
  unnest_tokens(word, text, token="words") %>%
  anti_join(stop_words) %>% 
  filter(!str_detect(word, "[.:0-9]")) #omits strings with any of the characters in the brackets

Using our cleaned data, we can generate a dataframe that holds the frequencies of each word grouped by real vs. fake news. Note that this value is represented by n, as opposed to total which holds the total number of words contained in fake/real news articles in the dataset.

# frequency function from Twitter lab
frequency <- cleaned %>% 
  group_by(truth) %>% 
  count(word, sort = TRUE) %>% 
  left_join(cleaned %>% 
              group_by(truth) %>% 
              summarise(total = n()))

head(frequency, 5)
## # A tibble: 5 x 4
## # Groups:   truth [2]
##   truth word          n   total
##   <chr> <chr>     <int>   <int>
## 1 Fake  trump     75423 4261498
## 2 Real  trump     42970 4015938
## 3 Real  reuters   28955 4015938
## 4 Real  president 27271 4015938
## 5 Fake  president 26707 4261498

Time Distribution

To understand the timeframe of our dataset, we graphed a simple histogram of the dates that the articles were published.

As you can see, we’re working with articles spanning from early 2015 to the end of 2018. Looking at general trends, it seems like fake news rapidly jumped in 2016 and has been gradually declining ever since. The most likely explanation for this is the 2016 presidential election, since presidential election years typically always have increased levels of political awareness and activity. With more people getting interested in politics, a market appears for flashy clickbait articles that don’t always stand up to scrutiny. We suspect that the identities of the candidates who ran in that particular election, Hillary Clinton and Donald Trump, who were both quite controversial, fueled this spike to some extent.

By comparison, real news seems to have been fairly consistent throughout 2016 and most of 2017, abruptly increasing by a huge margin at the end of 2017, but then declining shortly after peaking. With the obvious exception of late 2017, this consistent trend also makes sense — by comparison to fake news, which is often written by smaller organizations whose popularity (and thus ability to create more content) wanes and waxes with presidential election years, “real” news tends to come from mainstream media outlets, who are much better established and have consistent readership. Since there’s always some news that needs to be reported, professional journalists should be able to output more or less the same amount of content regardless of the era they’re in.

We’re not certain as to why the amount of articles jumped so drastically in late 2017 and then declined almost immediately after, but it could simply be sampling bias — the creator of this dataset may have just found it more convenient to pull more recent articles. It could even be that the increase in reported fake news articles we found to have occurred in early 2016 was simply the result of increased awareness as the presidential campaign season progressed, and not an actual dramatic increase in fake news. Since we don’t know how the dataset was made, we can’t really know the answer to these questions.

Word Frequencies

News is news, so we expected both real and fake news to cover similar topics, but we wondered if there was a significant difference in what real and fake news chose to focus on, which might reveal any potential bias.

Our heuristic for this was simply looking at the top five words from both the fake news and real news categories and examine both their individual frequencies and the distribution among the top five. Since more news coverage would obviously result in more of a particular word showing up in articles, we decided that word frequency was a good measurement of news interest.

top_words = frequency %>% 
  arrange(desc(n)) %>%
  group_by(truth) %>% slice(1:5) # slice top 5 from both real and fake groups

# due to the way slice works, the first 5 elements in top_words
# will be from fake news and the last 5 will be from real news
top_fake = top_words[1:5,] 
top_real = top_words[6:10,]

Now that we have our top five words for each category, let’s put them on a bar graph.

As you can see, the words “Trump” and “President” are both frequently included in both fake and real news. This is, of course, not particularly surprising. Any US president would naturally be at or near the top of a word count of a dataset containing primarily American news, but this doesn’t explain why Trump’s name appears so much more than any other word in fake news particularly, when the gap is significantly closer in real news. Our guess is that Trump is simply an especially polarizing figure, so demagogues on both sides of the political aisle have an incentive to constantly invoke his name to stir up either support or opposition of the president.

“Obama” and “Clinton” also show up prominently, which makes sense given that Obama was the previous president and Hillary ran essentially as his successor. Both are quite controversial in right-wing media (especially with extreme sources who are likely to create fake news), and the election year where both were prominent in mainstream political discourse spans a large portion of the dataset, so both of them were naturally mentioned plenty of times in fake news.

Other than President Trump, real news covers more mundane topics — “house” (presumably just part of “House of Representatives” or “White House”), “government,” and “Reuters” (a prominent international news organization) round out our top 5 words. This also lines up with what we expected because a lot of the news isn’t supposed to be flashy; a significant amount of important reporting is done on the complex mechanics of government, which may not be as interesting to a wider audience.

Another interesting difference we noticed is that fake news tends to focus on individuals (Trump, Clinton, Obama) whereas real news focuses on institutions (“government,” “house,” “reuters”). President Trump shows up heavily in both, but that’s just because he’s the president. This reemphasizes the tendency of fake news to pursue flashy content — i.e. scandalous stories about polarizing individuals — whereas real news focuses on more normal government activity.

Words Over Time

We then wondered how the top word frequencies had changed over time — our first guess was that words like “Hillary” and “Clinton” would have gone down dramatically after she lost the 2016 election, whereas Trump, having ascended to the presidency, would naturally have mentions of his name increase by a great deal.

In order to definitively answer this question, we have to look at which words have seen their frequencies change the most dramatically during the time period. Our first step is to split the word token data into one-month chunks, where we can see both the number of times a particular word was used in a month (count), how many times that word was used in total (word_total), and how many words were used in that particular month (month_total).

# The following time series analysis for frequency was adapted from the Twitter lab.
words_over_time = cleaned %>% 
  mutate(time_floor = floor_date(formatted_date, unit = "1 month")) %>% 
  count(time_floor, truth, word) %>% 
  group_by(truth, time_floor) %>% 
  mutate(month_total = sum(n)) %>% 
  group_by(truth,word) %>% 
  mutate(word_total = sum(n)) %>% 
  ungroup() %>% 
  rename(count = n) %>% 
  filter(word_total > 18000) # only look at words that have been used >18000 times

head(words_over_time, 5) # just an example
## # A tibble: 5 x 6
##   time_floor truth word      count month_total word_total
##   <date>     <chr> <chr>     <int>       <int>      <int>
## 1 2015-03-01 Fake  clinton      20        1585      18289
## 2 2015-03-01 Fake  obama         7        1585      18047
## 3 2015-03-01 Fake  people        2        1585      26000
## 4 2015-03-01 Fake  president     7        1585      26707
## 5 2015-04-01 Fake  clinton     224       66715      18289

The next step is to “nest” our data, which means that we consolidate all our monthly data for each word and store it in a single row so that each word only has one row representing it. This allows us to generate models that tell us how much each word changed over time. Computing models for each and every word is quite computationally expensive, which is why we filtered out any word that had a word_total of less than 18,000 uses.

nested_data = words_over_time %>% 
  nest(-word, -truth)

nested_models = nested_data %>% 
  mutate(models = map(data, ~ glm(cbind(count,month_total) ~ time_floor, ., family="binomial")))

Once we have our models, it becomes a simple matter of unnesting the data and picking out which ones are statistically significant and tossing everything else out. In our case, every word’s model had a p-value of less than 0.05, so we can move forward smoothly.

slopes <- nested_models %>%
  mutate(models = map(models, tidy)) %>%
  unnest(cols = c(models)) %>%
  filter(term == "time_floor") %>%
  mutate(adjusted.p.value = p.adjust(p.value))

top_slopes <- slopes %>% 
  filter(adjusted.p.value < 0.05)

After all that number crunching, the only thing we have left to do is to graph our real and fake news data and compare them.

Real News Words Over Time

Like with overall mentions, Trump tended to dominate the real news cycle for most of our time period, which is only natural. His share of all real news coverage steadily increased throughout campaign season until it peaked roughly around late 2016, which is when the election was held. As expected, once the election was over, the public at large (and thus the media as well) shifted their attention to different stories, so his share started declining.

The most interesting development is the massive drop in mentions of President Trump at the end of 2017. We suspect this might be a result of our dataset’s sampling — if you look back at our histogram of publication dates, our dataset has a much higher number of articles from late 2017. Assuming that there are only so many stories about President Trump that can be written in a given timespan, filling up the dataset with many extra articles would naturally deflate his calculated share of total media coverage.

Fake News Words Over Time

Here, the most interesting trends are clearly Trump and Clinton, the two 2016 presidential contenders. We can observe that while fake news about Clinton dominated fake news in 2015 and periodically went up and down as the election proceeded, stories about her dropped off dramatically after her loss (as she was no longer in the public eye). On the other hand, Trump started at what looks to be nearly 0% of fake news coverage in early 2015 (presumably because he wasn’t widely known early on), but has since consistently had the most mentions in fake news. We assume that any incumbent president would have a lot of fake news stories written about them, but once again, we also suspect that Trump himself is especially prone to having fake news written about him.

Sentiment Analysis Methods

As interesting as word frequencies are, they can’t actually give us that much information as to whether or not a given article is real or fake. For example, both real and fake news articles frequently discuss President Trump, so knowing that an article frequently mentions him doesn’t help us distinguish the legitimacy of the article at all. That’s where sentiment analysis comes in, where we figure out the “sentiment” (feeling) conveyed by each word and tally up the most common sentiments expressed by both real and fake news. We can measure sentiment in two different ways: AFINN, where sentiment is just indexed on a scale of -5 to 5 (negative to positive emotion), and NRC, which associates words with common emotions like fear, anger, and sadness.

We hypothesized that fake news, with its trend towards clickbait-style content, might have more extreme sentiments, but let’s see how that bears out in the data.

# sent_df: dataframe with sentiment values
# sentiment: entiment label, "value" for afinn and "sentiment" for nrc
# This function takes in a dataframe with sentiment values and outputs the frequency of each factor for both real and fake news articles. We use this function to help generate our graphs later (graph code not shown to avoid clutter)
get_freq = function(sent_df, sentiment){
  # this is a quirk with R function parameters
  # need to create a new column called sentiment set equal to the column of the string passed into the variable sentiment
  sent_df$sentiment = sent_df[[sentiment]] 
  counts = sent_df %>% # Get the counts of each factor grouped by truth
    group_by(truth) %>%
    count(sentiment, sort = TRUE)
  totals = sent_df %>% # get total occurrences for real and fake
    group_by(truth) %>%
    summarise(total=n())
  frequency = counts %>% # combine dataframes and calculate frequency
    left_join(totals) %>%
    mutate(freq = n/total)
  return(frequency)
}

sentiment_afinn = cleaned %>% 
  inner_join(get_sentiments("afinn"))
abs = sentiment_afinn %>%
  mutate(value = abs(value))
sentiment_nrc = cleaned %>% 
  inner_join(get_sentiments("nrc"))

Article Sentiment

Here we analyze the sentiment of the text of each article.

AFINN

It appears that real news articles have a higher frequency of moderate sentiment values, whereas fake news articles tend to feature stronger sentiments in either direction. It’s also worth noting that fake articles lean more towards negatively charged words, whereas real news articles seem to be a little more balanced.

AFINN Absolute Value

If we take the absolute values of the AFINN scores, we can isolate the overall strength of sentiment in our text. We can see even more clearly that real news articles have a higher frequency of moderate AFINN values (1-2), whereas fake news articles tend to use stronger wording (AFINN of 3+)

NRC

Our NRC sentiment analysis gives us a little more specificity while also confirming our findings from the AFINN analysis. Fake news articles have higher frequencies of negatively charged words, especially those of anger, disgust, and general negative sentiment. At the same time, they have lower occurrences of positivity and trust when compared to real news articles.

Title Sentiment

We then perform the same sentiment analysis on just the article titles:

titles = data %>%
  unnest_tokens(word, title, token="words") %>%
  anti_join(stop_words) %>%
  filter(!str_detect(word, "[.:0-9]"))

title_afinn = titles %>%
  inner_join(get_sentiments("afinn"))
title_abs = title_afinn %>%
  mutate(value = abs(value))
title_nrc = titles %>%
  inner_join(get_sentiments("nrc"))

As you can see from the graphs below, we get very similar results. However, the differences seem to be more exaggerated for our titles than the actual content of our articles. This trend confirms our hypotheses, as authors of fake news articles tend to use flashy and clickbaity titles in hopes of luring in potential readers.

AFINN

AFINN Absolute Value

NRC

Comparison

To further investigate the difference in sentiment, or perhaps exaggeration, of our titles, we compare them directly to the sentiment of the articles themselves.

This is a helper function designed to combine a title sentiment dataset and an article sentiment dataset. The key metric here is the difference in frequency of each sentiment value between real and fake articles. A positive value indicates that there was a higher occurrence of the sentiment value in real articles, whereas a negative value indicates a higher occurrence in fake articles. We want to compare the differences of the real and fake articles between the titles of the articles and the articles themselves.

# title: sentiment data on titles
# article: sentiment data on articles
# sentiment: sentiment label, "value" for afinn and "sentiment" for nrc
# This function takes in two dataframes with sentiment values and outputs a new dataframe.
# Each row in this dataframe represents the difference in sentiment value frequencies between
# real and fake news articles for either the title or article body.
title_vs_article = function(title, article, sentiment){
  #get the frequencies dataframes for both title and article
  title = get_freq(title, sentiment) %>% mutate(text="title") # label where the text is from
  article = get_freq(article, sentiment) %>% mutate(text="article")
  combined = rbind(title,article)
  combined = combined %>%
    group_by(text) %>%
    select(truth, sentiment, freq) %>%
    spread(truth, freq) %>% # separate real and fake values into separate columns
    arrange(Real, Fake) %>%
    mutate(Real = replace_na(Real,0), # replace na values with 0
           Fake = replace_na(Fake,0),
           diff = Real-Fake) # take the difference in frequency between real and fake occurrences
  return(combined)
}

AFINN

As seen in the earlier graphs, fake news articles are more charged, particularly with negative sentiment, but we can also observe the difference in magnitude between the titles and the articles themselves. In every category except for neutral positive words, the magnitude of the difference between real and fake news is greater for titles than articles. This essentially means that the level of bias is even greater in fake news titles when compared to the level of bias in fake news articles.

AFINN Absolute Value

Similar to the absolute value graphs above, this graph further emphasizes the difference in level of bias. Whether positive or negative, fake news titles appear to be even more exaggerated than the articles themselves.

NRC

We again see that fake news articles are skewed more towards negative emotions, with their titles being even more negative. On the other hand, real news articles and their titles follow a similar pattern of titles being stronger than their articles, but they are more likely to trend towards the direction of sentiments like positivity and trust.

Conclusions

By and large, most of our hypotheses were confirmed by the evidence. Fake news tended to focus on prominent political figures like the president, while real news tended to cover a broader range of government activity. Our sentiment analysis confirmed our prediction that fake news articles would be more strongly charged than real news, a difference that was even more stark when comparing titles. After all, the point of titles is to draw in readers, so having a flashier title makes sense even for real news.

The radical sentiments in fake news are what make it the most troubling — simply disseminating fake news that is completely neutral doesn’t actually do anything, but reading especially polarized fake news ends up polarizing people. Since real news tends to be more neutral, and people tend to prefer news that lines up with their preexisting opinions, this leads to a negative feedback loop of people reading progressively more extreme news with an increasingly large detachment from the truth. Anyone should be able to see why this is immensely troubling, which is why the simplest advice we can offer is that next time you read a news article, ask yourself: “How does this make me feel?” If your answer is a strong emotion, then it’s more likely that the article is questionable.

Everyone is biased in some way or another, and this will always inevitably bleed into anyone’s news coverage, so it’s also a good idea to get news from a variety of different sources. The line between “real” and “fake” is not always as clear as our dataset might imply, so the best way to avoid being misled is a balanced media diet from reputable sources.

Future Work

One major issue we ran into while working is that we were completely uncertain as to how this dataset was created, and thus unsure that it was reliably sampled. With a subject as polarized as what constitutes “fake” news, it’s important to have a trustworthy dataset to base your analysis off of. However, there were some notable issues with our particular dataset that made us doubt its legitimacy. For example, the incredible spike of real news datapoints in late 2017 and the dramatic drop of mentions of “Trump,” who as the President should always be quite commonly featured in mainstream news articles, were both phenomena that we couldn’t think of an explanation for, making us believe it was the product of unreliable sampling by the creator of the dataset. The unusually high occurrence of “Reuters” is somewhat understandable given that many other reporters source their coverage from them, but there are other organizations like the Associated Press that perform similar functions. Given the other unexplainable characteristics of our dataset, the fact that only Reuters was in the top words is also somewhat questionable.

For further study, we could build a classifier to actually determine the likelihood of whether any given news article is real or fake. We could also do the text analysis with n-grams to look at not just words individually, but as groups of words. Data collection could be improved as well — instead of relying on an online dataset that was created in an unclear way, we could make our own, which would help us not only clear up any uncertainty but also add extra labels that would help develop more advanced models. There could have been a more in-depth definition of what a “fake” article is — it could mean misleading, deliberately deceptive, or just completely factually wrong. Our dataset didn’t distinguish between those, simply throwing all the articles into separate “real” and “fake” CSV files that we had to manually combine. It would have also been nice to have each articles’ source as a label; it was possible to manually find out the source just by copy/pasting the title verbatim into Google, but it would have been too tedious for us to do that for 40,000+ articles.

Sources

Background Image: https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fbernardmarr%2Ffiles%2F2018%2F05%2FAdobeStock_187220917-1200x796.jpg

Data: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset#

Text Mining Lecture Code

Twitter Lab Code

Politico Article: https://www.politico.com/news/2020/04/16/facebook-fake-news-coronavirus-190054

Weapons of Math Destruction