Fake News!

Our project started with the seemingly innocuous proposal of creating an algorithm to identify real vs fake news, a classification that has had a lot of cache over the last half decade. Is it possible to create such an algorithm?

Academic definition:

“…“fake news” (are) news stories that are false: the story itself is fabricated, with no verifiable facts, sources or quotes." -University of Michigan Libraries.

Cultural definition:

In its current form, it is often used to identify political bias of the news source. This problem is exacerbated by news sources either implicitly or explicitly denying any bias, while butting up against the reality that even if what they report objectively happened and is verifiable, what they choose and choose not to report is a form of bias in itself.

In Summation:

“Not only do different people have opposing views about the meaning of “fake news”, in practice the term undermines the intellectual values of democracy – and there is a real possibility that it means nothing. We would be better off if we stopped using it."-Chris Ratcliffe

In light of this, we have chosen the academic approach, and will try to build an algorithm that can identify news with verifiable fact versus misinformation.

Data

We located a source from Kaggle where the author composed a dataset of news articles that have been identified as misinformation through fact check research, and a set of articles that have been verified as truthful.

The Onion:

We identified the satire website “The Onion” as a good source of misinformation, because it publishes comedy in the form of news satire.

“All The News” Dataset:

We found a dataset on Kaggle that has 150,000 news articles from six major online publishers scraped from 2015 to 2017.

Tidying

We created several document term matrices (dtm) based on the different data sources we used to collect news articles. We then performed tidying to ensure that these dtms fit the format of the models.

Notable functions:

?format_dtm()

Description

Format a document term matrix into the structure of another document term matrix. Removes columns from input_df not used in reference_df, then adds columns NOT in input_df that are present in reference_df and sets values to zero.

Usage

format_dtm(input_df, reference_df)

Arguments

input_df: the dtm to be modified

reference_df: dtm


?create_dtm_for_model()

Description

Takes a source news data frame, creates a corpus and prepares it to be used with a trained model.

Usage

create_dtm_for_model(source_csv, comparison_dtm, output_file)

Arguments

source_csv: source news data frame consisting of doc_id and text.

comparison_dtm: csv file with the dtm used for model training

output_file: resulting csv with a dtm that can be used to test with the model.

Training a model

We used the randomForest() algorithm for classifying our data as real or fake news using a 70%/30% train/test segmentation.

We built one model off of the real/fake dataset from University of Victoria and a second with our “Onion” scrape as the misinformation classification and the UV “real” data.

Confusion Matrix using the Fake/True model

Confusion Matrix using The Onion dtm

Analysis

?perform_random_forest()

Description

Trains a random forest model based on a document term matrix and applies that trained model to a second document term matrix to classify the data as Information (“True”) or Misinformation (“Fake”). The function returns a predition dataframe.

Usage

perform_random_forest(base_dtm, target_dtm)

Arguments

base_dtm: previously classified csv.

target_dtm: unclassified document term matrix on which to perform document analysis.

Exploratory Usage Of Our Models

The Onion model appears to be working and confirms our prior expectations. We would not expect any news outlet to have a majority of its articles be misinformation.

Exploring Data with Visualizations

We also did a sentiment analysis on the “fake” and “real” news datasets. We are not able to fact check the articles in this analysis to truly identify which articles are factually correct and which are truly “fake” news but we can analyze the use of positive words and negative words.

For more visualization, take a look at this rpubs.

Conclusion

We successfully created a model from the test data we segmented. When we applied our models to data outside of the scope our training models, we were presented with a bunch of questions.

  • When can you draw conclusions to unclassified data?
  • We noticed that some of the articles that were classified as fake consisted of emotional or poetic language. For example, a music review was classified as misinformation. To what extent does emotional language throw off our models?

When we tried to improve our model by removing variables that were not related to the content of the article, the accuracy in the test sets remained high but the outside of scope went haywire!

Next Steps:

  1. How can we train our models to deal with articles outside of scope?
  2. When can apply models outside of the scope?