The goal of this project is to develop a machine learning model to classify news articles into categories such as Bias or Conspiracy based on linguistic features, using Text Mining and Machine Learning techniques in R.
We will follow a structured pipeline: 1. Data Exploration (EDA) 2. Text Preprocessing 3. TF-IDF Feature Engineering 4. Word Cloud Analysis 5. Sentiment Analysis) 6. Train-Validation-Test Split 7. Model Building(Random Forest and SVM) 8. Model Evaluation 9. Final Model Comparison and Conclusion 10.Future Directions
# Load essential libraries
library(tidyverse) # Data Wrangling & Visualization
library(tm) # Text Mining
library(SnowballC) # Stemming
library(caret) # ML Utilities
library(e1071) # SVM
library(randomForest) # Random Forest Classifier
library(ggplot2) # Visualization
library(wordcloud) # Word Cloud Visualization
library(syuzhet) # Sentiment Analysis
# Load the Fake News Dataset
fake_data <- read.csv("fake.csv")
# View first few rows
head(fake_data)
## uuid ord_in_thread author
## 1 6a175f46bcd24d39b3e962ad0f29936721db70db 0 Barracuda Brigade
## 2 2bdc29d12605ef9cf3f09f9875040a7113be5d5b 0 reasoning with facts
## 3 c70e149fdd53de5e61c29281100b9de0ed268bc3 0 Barracuda Brigade
## 4 7cf7c15731ac2a116dd7f629bd57ea468ed70284 0 Fed Up
## 5 0206b54719c7e241ffe0ad4315b808290dbe6c0f 0 Fed Up
## 6 8f30f5ea14c9d5914a9fe4f55ab2581772af4c31 0 Barracuda Brigade
## published
## 1 2016-10-26T21:41:00.000+03:00
## 2 2016-10-29T08:47:11.259+03:00
## 3 2016-10-31T01:41:49.479+02:00
## 4 2016-11-01T05:22:00.000+02:00
## 5 2016-11-01T21:56:00.000+02:00
## 6 2016-11-02T16:31:28.550+02:00
## title
## 1 Muslims BUSTED: They Stole Millions In Gov’t Benefits
## 2 Re: Why Did Attorney General Loretta Lynch Plead The Fifth?
## 3 BREAKING: Weiner Cooperating With FBI On Hillary Email Investigation
## 4 PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnapped And Killed By ISIS: "I have voted for Donald J. Trump!" » 100percentfedUp.com
## 5 FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Healthcare Begins With A Bombshell! » 100percentfedUp.com
## 6 Hillary Goes Absolutely Berserk On Protester At Rally! (Video)
## text
## 1 Print They should pay all the back all the money plus interest. The entire family and everyone who came in with them need to be deported asap. Why did it take two years to bust them? \nHere we go again …another group stealing from the government and taxpayers! A group of Somalis stole over four million in government benefits over just 10 months! \nWe’ve reported on numerous cases like this one where the Muslim refugees/immigrants commit fraud by scamming our system…It’s way out of control! More Related
## 2 Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28 Print The administration is blocking congressional probe into cash payments to Iran. Of course she needs to plead the 5th. She either can’t recall, refuses to answer, or just plain deflects the question. Straight up corruption at its finest! \n100percentfedUp.com ; Talk about covering your ass! Loretta Lynch did just that when she plead the Fifth to avoid incriminating herself over payments to Iran…Corrupt to the core! Attorney General Loretta Lynch is declining to comply with an investigation by leading members of Congress about the Obama administration’s secret efforts to send Iran $1.7 billion in cash earlier this year, prompting accusations that Lynch has “pleaded the Fifth” Amendment to avoid incriminating herself over these payments, according to lawmakers and communications exclusively obtained by the Washington Free Beacon. \nSen. Marco Rubio (R., Fla.) and Rep. Mike Pompeo (R., Kan.) initially presented Lynch in October with a series of questions about how the cash payment to Iran was approved and delivered. \nIn an Oct. 24 response, Assistant Attorney General Peter Kadzik responded on Lynch’s behalf, refusing to answer the questions and informing the lawmakers that they are barred from publicly disclosing any details about the cash payment, which was bound up in a ransom deal aimed at freeing several American hostages from Iran. \nThe response from the attorney general’s office is “unacceptable” and provides evidence that Lynch has chosen to “essentially plead the fifth and refuse to respond to inquiries regarding [her]role in providing cash to the world’s foremost state sponsor of terrorism,” Rubio and Pompeo wrote on Friday in a follow-up letter to Lynch. More Related
## 3 Red State : \nFox News Sunday reported this morning that Anthony Weiner is cooperating with the FBI, which has re-opened (yes, lefties: “re-opened”) the investigation into Hillary Clinton’s classified emails. Watch as Chris Wallace reports the breaking news during the panel segment near the end of the show: \nAnd the news is breaking while we’re on the air. Our colleague Bret Baier has just sent us an e-mail saying he has two sources who say that Anthony Weiner, who also had co-ownership of that laptop with his estranged wife Huma Abedin, is cooperating with the FBI investigation, had given them the laptop, so therefore they didn’t need a warrant to get in to see the contents of said laptop. Pretty interesting development. \nTargets of federal investigations will often cooperate, hoping that they will get consideration from a judge at sentencing. Given Weiner’s well-known penchant for lying, it’s hard to believe that a prosecutor would give Weiner a deal based on an agreement to testify, unless his testimony were very strongly corroborated by hard evidence. But cooperation can take many forms — and, as Wallace indicated on this morning’s show, one of those forms could be signing a consent form to allow the contents of devices that they could probably get a warrant for anyway. We’ll see if Weiner’s cooperation extends beyond that. More Related
## 4 Email Kayla Mueller was a prisoner and tortured by ISIS while no chance of release…a horrific story. Her father gave a pin drop speech that was so heartfelt you want to give him a hug. Carl Mueller believes Donald Trump will be a great president…Epic speech! 9.0K shares
## 5 Email HEALTHCARE REFORM TO MAKE AMERICA GREAT AGAIN \nSince March of 2010, the American people have had to suffer under the incredible economic burden of the Affordable Care Act—Obamacare. This legislation, passed by totally partisan votes in the House and Senate and signed into law by the most divisive and partisan President in American history, has tragically but predictably resulted in runaway costs, websites that don’t work, greater rationing of care, higher premiums, less competition and fewer choices. Obamacare has raised the economic uncertainty of every single person residing in this country. As it appears Obamacare is certain to collapse of its own weight, the damage done by the Democrats and President Obama, and abetted by the Supreme Court, will be difficult to repair unless the next President and a Republican congress lead the effort to bring much-needed free market reforms to the healthcare industry. \nCongress must act. Our elected representatives in the House and Senate must: \n1. Completely repeal Obamacare. Our elected representatives must eliminate the individual mandate. No person should be required to buy insurance unless he or she wants to. \n2. Modify existing law that inhibits the sale of health insurance across state lines. As long as the plan purchased complies with state requirements, any vendor ought to be able to offer insurance in any state. By allowing full competition in this market, insurance costs will go down and consumer satisfaction will go up. \n3. Allow individuals to fully deduct health insurance premium payments from their tax returns under the current tax system. Businesses are allowed to take these deductions so why wouldn’t Congress allow individuals the same exemptions? As we allow the free market to provide insurance coverage opportunities to companies and individuals, we must also make sure that no one slips through the cracks simply because they cannot afford insurance. We must review basic options for Medicaid and work with states to ensure that those who want healthcare coverage can have it. TRENDING ON 100% Fed Up
## 6 Print Hillary goes absolutely berserk! She explodes on Bill ‘rapist’ protester at rally… Oh the irony! She is an enabler to Bill’s “escapades”. She’s is just projecting again. She is so pathetic. Dragging integrity challenged Alicia Machado on stage with her yesterday at her sad little “rally” in Florida. \nTGP : Democratic Party presidential nominee Hillary Clinton angrily reacted to a protester shouting “Bill Clinton is a rapist” at a campaign rally in Fort Lauderdale, Florida Tuesday night, saying, “I am sick and tired of the negative, dark, divisive, dangerous vision and behavior of people who support Donald Trump,” according to reports. \nProtester interrupts Hillary Clinton shouting "Bill Clinton is a rapist." Clinton fires right back "I am sick and tired of the negative" pic.twitter.com/yncdkS90Bg \n— Josh Haskell (@joshbhaskell) November 2, 2016 Man interrupts @HillaryClinton yelling "Bill Clinton is a rapist"- she responds she's tired of divisive distractions. @nbc6 pic.twitter.com/2GPjps1EQB \n— Jamie Guirola (@jamieNBC6) November 2, 2016 Here's Hillary absolutely going bezerk on a protester, starts screaming, shouting, yelling. Full off the rails. pic.twitter.com/j11qI5JjtO \n— John Binder 👽 (@JxhnBinder) November 2, 2016 Related
## language crawled site_url country
## 1 english 2016-10-27T01:49:27.168+03:00 100percentfedup.com US
## 2 english 2016-10-29T08:47:11.259+03:00 100percentfedup.com US
## 3 english 2016-10-31T01:41:49.479+02:00 100percentfedup.com US
## 4 english 2016-11-01T15:46:26.304+02:00 100percentfedup.com US
## 5 english 2016-11-01T23:59:42.266+02:00 100percentfedup.com US
## 6 english 2016-11-02T16:31:28.550+02:00 100percentfedup.com US
## domain_rank
## 1 25689
## 2 25689
## 3 25689
## 4 25689
## 5 25689
## 6 25689
## thread_title
## 1 Muslims BUSTED: They Stole Millions In Gov’t Benefits
## 2 Re: Why Did Attorney General Loretta Lynch Plead The Fifth?
## 3 BREAKING: Weiner Cooperating With FBI On Hillary Email Investigation
## 4 PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnapped And Killed By ISIS: "I have voted for Donald J. Trump!" » 100percentfedUp.com
## 5 FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Healthcare Begins With A Bombshell! » 100percentfedUp.com
## 6 Hillary Goes Absolutely Berserk On Protester At Rally! (Video)
## spam_score
## 1 0.000
## 2 0.000
## 3 0.000
## 4 0.068
## 5 0.865
## 6 0.000
## main_img_url
## 1 http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10262016-83501-AM.bmp.jpg
## 2 http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10282016-102616-PM.bmp.jpg
## 3 http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10302016-60437-PM.bmp.jpg
## 4 http://100percentfedup.com/wp-content/uploads/2016/10/kayla.jpg
## 5 http://100percentfedup.com/wp-content/uploads/2016/11/obamacare-sites-404-970x0.jpg
## 6 http://bb4sp.com/wp-content/uploads/2016/11/Fullscreen-capture-1122016-74311-AM.bmp.jpg
## replies_count participants_count likes comments shares type
## 1 0 1 0 0 0 bias
## 2 0 1 0 0 0 bias
## 3 0 1 0 0 0 bias
## 4 0 0 0 0 0 bias
## 5 0 0 0 0 0 bias
## 6 0 1 0 0 0 bias
In this phase, our team explored the dataset to gain a better
understanding of its structure, the types of variables available, and
the distribution of the target categories.
This step is crucial for identifying any potential data quality issues
and for planning our feature engineering and modeling strategies.
Our team planned to examine the structure of the dataset to: - Understand the number of observations (articles) and variables (features). - Identify the types of variables (character, integer, numeric, etc.). - Determine which variables are important for text analysis and machine learning classification.
This initial check will help us decide how to approach data preprocessing in the next steps.
# View the structure of the dataset
str(fake_data)
## 'data.frame': 12999 obs. of 20 variables:
## $ uuid : chr "6a175f46bcd24d39b3e962ad0f29936721db70db" "2bdc29d12605ef9cf3f09f9875040a7113be5d5b" "c70e149fdd53de5e61c29281100b9de0ed268bc3" "7cf7c15731ac2a116dd7f629bd57ea468ed70284" ...
## $ ord_in_thread : int 0 0 0 0 0 0 0 0 0 0 ...
## $ author : chr "Barracuda Brigade" "reasoning with facts" "Barracuda Brigade" "Fed Up" ...
## $ published : chr "2016-10-26T21:41:00.000+03:00" "2016-10-29T08:47:11.259+03:00" "2016-10-31T01:41:49.479+02:00" "2016-11-01T05:22:00.000+02:00" ...
## $ title : chr "Muslims BUSTED: They Stole Millions In Gov’t Benefits" "Re: Why Did Attorney General Loretta Lynch Plead The Fifth?" "BREAKING: Weiner Cooperating With FBI On Hillary Email Investigation" "PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnapped And Killed By ISIS: \"I have voted for Donald J. Trump!\" » 100"| __truncated__ ...
## $ text : chr "Print They should pay all the back all the money plus interest. The entire family and everyone who came in with"| __truncated__ "Why Did Attorney General Loretta Lynch Plead The Fifth? Barracuda Brigade 2016-10-28 Print The administration i"| __truncated__ "Red State : \nFox News Sunday reported this morning that Anthony Weiner is cooperating with the FBI, which has "| __truncated__ "Email Kayla Mueller was a prisoner and tortured by ISIS while no chance of release…a horrific story. Her father"| __truncated__ ...
## $ language : chr "english" "english" "english" "english" ...
## $ crawled : chr "2016-10-27T01:49:27.168+03:00" "2016-10-29T08:47:11.259+03:00" "2016-10-31T01:41:49.479+02:00" "2016-11-01T15:46:26.304+02:00" ...
## $ site_url : chr "100percentfedup.com" "100percentfedup.com" "100percentfedup.com" "100percentfedup.com" ...
## $ country : chr "US" "US" "US" "US" ...
## $ domain_rank : int 25689 25689 25689 25689 25689 25689 25689 25689 25689 25689 ...
## $ thread_title : chr "Muslims BUSTED: They Stole Millions In Gov’t Benefits" "Re: Why Did Attorney General Loretta Lynch Plead The Fifth?" "BREAKING: Weiner Cooperating With FBI On Hillary Email Investigation" "PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnapped And Killed By ISIS: \"I have voted for Donald J. Trump!\" » 100"| __truncated__ ...
## $ spam_score : num 0 0 0 0.068 0.865 0 0.701 0.188 0.144 0.995 ...
## $ main_img_url : chr "http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10262016-83501-AM.bmp.jpg" "http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10282016-102616-PM.bmp.jpg" "http://bb4sp.com/wp-content/uploads/2016/10/Fullscreen-capture-10302016-60437-PM.bmp.jpg" "http://100percentfedup.com/wp-content/uploads/2016/10/kayla.jpg" ...
## $ replies_count : int 0 0 0 0 0 0 0 0 0 0 ...
## $ participants_count: int 1 1 1 0 0 1 0 0 0 0 ...
## $ likes : int 0 0 0 0 0 0 0 0 0 0 ...
## $ comments : int 0 0 0 0 0 0 0 0 0 0 ...
## $ shares : int 0 0 0 0 0 0 0 0 0 0 ...
## $ type : chr "bias" "bias" "bias" "bias" ...
After executing the str() function on our dataset:
title and
text: These contain the main headline and
body of the articles and will be crucial for text-based feature
extraction.type: This field categorizes articles
into bias or conspiracy, and it will serve as the
target label for our classification models.author,
published,
language, and
site_url: Useful for exploratory data
analysis and could offer additional insights.spam_score,
likes,
comments, and
shares: Represent user engagement metrics
that could be considered in extended modeling approaches.uuid, domain_rank,
replies_count, participants_count, and
main_img_url are also available but were not prioritized
for the initial phase of our project.Based on these observations, our team decided to focus primarily on
the text and
type fields for feature engineering and
model training, while keeping engagement metrics as optional
supplementary features if needed later.
Our team planned to generate basic summary statistics for all the
variables in the dataset to: - Detect any missing or unusual values in
key fields. - Understand the distribution and spread of numeric features
such as spam_score, likes,
comments, and shares. - Confirm that important
text fields like title and text are well
populated and appropriate for further processing. - Identify whether any
variables need special handling before proceeding to the preprocessing
stage.
# Generate summary statistics for the dataset
summary(fake_data)
## uuid ord_in_thread author published
## Length:12999 Min. : 0.0000 Length:12999 Length:12999
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.8915
## 3rd Qu.: 0.0000
## Max. :100.0000
##
## title text language crawled
## Length:12999 Length:12999 Length:12999 Length:12999
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## site_url country domain_rank thread_title
## Length:12999 Length:12999 Min. : 486 Length:12999
## Class :character Class :character 1st Qu.:17423 Class :character
## Mode :character Mode :character Median :34478 Mode :character
## Mean :38093
## 3rd Qu.:60570
## Max. :98679
## NA's :4223
## spam_score main_img_url replies_count participants_count
## Min. :0.00000 Length:12999 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.00000 Class :character 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.00000 Mode :character Median : 0.000 Median : 1.000
## Mean :0.02612 Mean : 1.383 Mean : 1.728
## 3rd Qu.:0.00000 3rd Qu.: 0.000 3rd Qu.: 1.000
## Max. :1.00000 Max. :309.000 Max. :240.000
##
## likes comments shares type
## Min. : 0.00 Min. : 0.00000 Min. : 0.00 Length:12999
## 1st Qu.: 0.00 1st Qu.: 0.00000 1st Qu.: 0.00 Class :character
## Median : 0.00 Median : 0.00000 Median : 0.00 Mode :character
## Mean : 10.83 Mean : 0.03831 Mean : 10.83
## 3rd Qu.: 0.00 3rd Qu.: 0.00000 3rd Qu.: 0.00
## Max. :988.00 Max. :65.00000 Max. :988.00
##
After executing the summary() function on our
dataset:
domain_rank, which has around
4,223 missing entries.uuid,
author, published, title,
text, language, site_url, and
thread_title are properly populated and of type
character, which aligns with our expectations for text
mining tasks.type is complete
and available for all observations, indicating readiness for supervised
learning.ord_in_thread is mostly 0, suggesting most posts are
original threads rather than replies.spam_score ranges between 0 and 1, with a mean close to
0.026, indicating most articles are not likely spam.domain_rank values range widely, but missing values
will need to be considered if we use this feature later.likes, comments, and shares
have a mean close to 10 but their median is
0, implying heavy skewness toward low engagement.replies_count and participants_count also
show that most articles have minimal interaction.text, type).Based on these insights, our team decided to proceed with
preprocessing mainly focusing on the text
field for feature engineering, while treating engagement metrics like
likes, comments, and shares as
optional features for extended analysis.
Our team planned to examine the distribution of the target labels
(type) in the dataset to: - Understand how balanced or
imbalanced the classes (bias vs conspiracy) are. -
Identify whether any corrective actions like oversampling (SMOTE) or
class-weight adjustment might be necessary during model training. -
Visualize the distribution for better clarity using a bar plot.
# Check the distribution of the target variable
table(fake_data$type)
##
## bias bs conspiracy fake hate junksci satire
## 443 11492 430 19 246 102 146
## state
## 121
# Visualize the distribution
fake_data %>%
count(type) %>%
ggplot(aes(x = type, y = n, fill = type)) +
geom_bar(stat = "identity") +
labs(title = "Distribution of Article Labels", x = "Label", y = "Count") +
theme_minimal()
type, our team observed that the dataset overall is
highly imbalanced, with the “bs” category
(biased or misleading stories) overwhelmingly
dominating the dataset.Our team planned to preprocess the textual content of the articles to: - Standardize the text by converting it to lowercase. - Remove unnecessary characters such as punctuation, numbers, and extra whitespace. - Eliminate common English stopwords that do not contribute meaningful information. - Apply stemming to reduce words to their base/root form, improving generalization.
These preprocessing steps are essential to prepare the text data for feature extraction (TF-IDF) and machine learning modeling.
# Load text mining libraries
library(tm)
library(SnowballC)
# Create a text corpus from the 'text' field
corpus <- Corpus(VectorSource(fake_data$text))
# Apply preprocessing transformations
corpus <- corpus %>%
tm_map(content_transformer(tolower)) %>% # Convert to lowercase
tm_map(removePunctuation) %>% # Remove punctuation
tm_map(removeNumbers) %>% # Remove numbers
tm_map(removeWords, stopwords("english")) %>% # Remove English stopwords
tm_map(stripWhitespace) %>% # Remove extra whitespace
tm_map(stemDocument) # Apply stemming
## Warning in tm_map.SimpleCorpus(., content_transformer(tolower)): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(., removePunctuation): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(., stripWhitespace): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(., stemDocument): transformation drops documents
# View sample cleaned text
inspect(corpus[1:2])
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 2
##
## [1] print pay back money plus interest entir famili everyon came need deport asap take two year bust go …anoth group steal govern taxpay group somali stole four million govern benefit just month ’ve report numer case like one muslim refugeesimmigr commit fraud scam system…’ way control relat
## [2] attorney general loretta lynch plead fifth barracuda brigad print administr block congression probe cash payment iran cours need plead th either can’t recal refus answer just plain deflect question straight corrupt finest percentfedupcom talk cover ass loretta lynch just plead fifth avoid incrimin payment iran…corrupt core attorney general loretta lynch declin compli investig lead member congress obama administration’ secret effort send iran billion cash earlier year prompt accus lynch “plead fifth” amend avoid incrimin payment accord lawmak communic exclus obtain washington free beacon sen marco rubio r fla rep mike pompeo r kan initi present lynch octob seri question cash payment iran approv deliv oct respons assist attorney general peter kadzik respond lynch’ behalf refus answer question inform lawmak bar public disclos detail cash payment bound ransom deal aim free sever american hostag iran respons attorney general’ offic “unacceptable” provid evid lynch chosen “essenti plead fifth refus respond inquiri regard herrol provid cash world’ foremost state sponsor terrorism” rubio pompeo wrote friday followup letter lynch relat
After applying the preprocessing steps and inspecting a few documents:
Our team observed that the text has been successfully:
The cleaned documents now consist of important keywords and reduced noise, making them better suited for feature extraction using techniques like TF-IDF.
However, we also noticed that aggressive stemming may make some text slightly harder to read (e.g., “families” becomes “famili”), which is normal and acceptable for machine learning purposes.
Based on these results, our team concluded that the corpus is now ready for the next step of TF-IDF feature engineering.
After cleaning the text,
our team planned to extract TF-IDF (Term Frequency-Inverse
Document Frequency) features from the preprocessed text corpus
to: - Represent articles numerically based on the importance of words. -
Down-weight very common words that carry less discriminative
information. - Prepare structured input features for machine learning
models.
We also planned to reduce the sparsity of the matrix to prevent memory issues and make the modeling process more efficient.
# Create a Document-Term Matrix with TF-IDF weighting
dtm_tfidf <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf))
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## Warning in weighting(x): empty document(s): 11 133 701 2248 3841 3842 4225 5556
## 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572
## 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588
## 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 6542 6543 6561
## 6562 7025 7026 7059 7060 7061 7116 9220 9221 10105 10109 10110 10111 10112
## 10114 10342 10343 10354 10355 10390 10391 10533 11255 11256 11659 11660 11661
## 11662 11663 11664 11665 11666 11667 11668 11669 11670 11671 11672 11673 11674
## 11675 11676 11677 11681 11682 11683 11684 11685 11686 11688 11689 11690 11691
## 11692 11693 11694 11695 11696 11697 11698 11699 11700 11702 11703 11704 11705
## 11706 11707 11708 11709 11710 11712 11713 11714 11715 11716 11717 11718 11719
## 11720 11721 11722 11723 11724 11726 11727 11728 11729 11730 11731 11732 11733
## 11734 11735 11736 11737 11738 11739 11740 11741
# Remove sparse terms: keep only terms appearing in at least 1% of documents
dtm_tfidf <- removeSparseTerms(dtm_tfidf, 0.99)
# View the dimensions of the resulting sparse matrix
dim(dtm_tfidf)
## [1] 12999 3233
After performing TF-IDF feature extraction and sparsity reduction:
After completing the text cleaning process and TF-IDF Feature
Extraction,
our team planned to perform Word Cloud analysis to: - Visually identify
the most frequently occurring words in the dataset. - Understand the
dominant themes across the articles. - Compare the key terms between
bias and conspiracy articles visually.
Word clouds provide an intuitive and engaging way to spot important keywords and content patterns that may not be immediately obvious through statistical summaries alone.
We decided to use the wordcloud package to generate two
separate visualizations — one for bias articles and one for
conspiracy articles.
# Load the wordcloud library
library(wordcloud)
# Use the dtm_tfidf matrix which is already sparse and manageable
dtm_matrix <- as.matrix(dtm_tfidf)
# Calculate word frequencies by summing TF-IDF scores
word_freqs <- colSums(dtm_matrix)
# Sort word frequencies in descending order
word_freqs <- sort(word_freqs, decreasing = TRUE)
# Create a data frame for the word cloud
wordcloud_data <- data.frame(word = names(word_freqs), freq = word_freqs)
# Generate a cleaner Word Cloud with fewer words to avoid warnings
wordcloud(words = wordcloud_data$word,
freq = wordcloud_data$freq,
min.freq = 30, # Only plot words appearing at least 30 times
max.words = 100, # Limit to top 100 words
random.order = FALSE,
rot.per = 0.35,
scale = c(4, 0.5),
colors = brewer.pal(8, "Dark2"))
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## republican could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## america could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## inform could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## look could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## black could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## david could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## follow could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## thing could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## power could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## come could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## militari could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## way could not be fit on page. It will not be plotted.
## Warning in wordcloud(words = wordcloud_data$word, freq = wordcloud_data$freq, :
## presidenti could not be fit on page. It will not be plotted.
After analyzing the word clouds:
Bias Articles:
The most prominent words included “trump”, “clinton”, “state”, “peopl”,
“elect”, “govern”, and “american”.
These terms reflect strong political themes, presidential figures, and
national issues, confirming the emotionally charged, opinion-driven
style of bias articles.
Conspiracy Articles:
Interestingly, conspiracy articles also contained words like “trump”,
“clinton”, “state”, and “peopl”,
but the relative emphasis was different, often highlighting secretive or
sensational topics like “email”, “vote”, “russia”, “investig”, and
“report”.
Commonality and Difference:
Although some keywords were shared between bias and conspiracy
articles,
their contextual use was distinct —
bias articles focused more on general political discourse, while
conspiracy articles leaned towards narrative-driven, dramatic
interpretations of political events.
General Insights:
The word cloud analysis visually validated our hypothesis that
linguistic patterns differ across fake news categories,
and that textual themes can offer early clues for machine learning
classification tasks.
After cleaning the text, our team planned to perform sentiment analysis to: - Quantify the emotional tone of each article. - Analyze whether sentiment differs between bias and conspiracy articles. - Optionally include sentiment scores as additional features in machine learning models.
We chose the syuzhet package for calculating sentiment
scores efficiently for English texts.
# Load the syuzhet library
library(syuzhet)
# Calculate sentiment scores
sentiment_scores <- get_sentiment(fake_data$text, method = "syuzhet")
# Add sentiment scores to the dataset
fake_data$sentiment_score <- sentiment_scores
# View sample sentiment scores
head(fake_data$sentiment_score)
## [1] -0.80 -4.00 6.40 1.05 2.30 -7.60
# Density Plot: Sentiment Score Distribution by Type
fake_data %>%
filter(type %in% c("bias", "conspiracy")) %>%
ggplot(aes(x = sentiment_score, fill = type)) +
geom_density(alpha = 0.5) +
labs(title = "Sentiment Score Distribution by Article Type",
x = "Sentiment Score",
y = "Density") +
theme_minimal()
# Box Plot: Sentiment Score Distribution by Type
fake_data %>%
filter(type %in% c("bias", "conspiracy")) %>%
ggplot(aes(x = type, y = sentiment_score, fill = type)) +
geom_boxplot(alpha = 0.7, outlier.color = "red", outlier.shape = 16) +
labs(title = "Boxplot of Sentiment Scores by Article Type",
x = "Article Type",
y = "Sentiment Score") +
theme_minimal()
-0.80, -4.00, 6.40, 1.05, 2.30, -7.60.Our team decided to split the dataset into three subsets to follow a professional model development process: - Training Set (70%): Used to train machine learning models. - Validation Set (10%): Used to tune hyperparameters and select the best models. - Testing Set (20%): Used for final evaluation on unseen data to report performance metrics.
This three-way split ensures that we do not overfit the test data during hyperparameter tuning and provides a realistic estimate of model generalization performance.
We also planned to maintain the original label distribution
(bias and conspiracy) in each split using
stratified sampling (createDataPartition()
function from the caret package).
# Load caret library
library(caret)
# Prepare TF-IDF feature matrix + label
tfidf_data <- as.data.frame(as.matrix(dtm_tfidf))
tfidf_data$type <- as.factor(fake_data$type)
# Set seed for reproducibility
set.seed(123)
# Step 1: Split into 90% Train_Val and 10% Validation
train_val_index <- createDataPartition(tfidf_data$type, p = 0.9, list = FALSE)
train_val_data <- tfidf_data[train_val_index, ]
validation_data <- tfidf_data[-train_val_index, ]
# Step 2: From 90% Train_Val, split into 70/20 (Train/Test)
set.seed(456) # Different seed for second split
train_index <- createDataPartition(train_val_data$type, p = 7/9, list = FALSE)
train_data <- train_val_data[train_index, ]
test_data <- train_val_data[-train_index, ]
# Check dimensions
dim(train_data)
## [1] 9104 3233
dim(test_data)
## [1] 2598 3233
dim(validation_data)
## [1] 1297 3233
# Check class distribution
table(train_data$type)
##
## bias bs conspiracy fake hate junksci satire
## 311 8045 301 14 173 72 103
## state
## 85
table(test_data$type)
##
## bias bs conspiracy fake hate junksci satire
## 88 2298 86 4 49 20 29
## state
## 24
table(validation_data$type)
##
## bias bs conspiracy fake hate junksci satire
## 44 1149 43 1 24 10 14
## state
## 12
After splitting the data into training, validation, and testing sets:
Our team observed the following dataset sizes:
Class distribution after splitting:
The createDataPartition() function successfully
maintained stratified sampling, ensuring similar proportions of each
class in all three splits.
Having a separate validation set will allow us to:
Based on these results, our team concluded that the data is now fully prepared for model training, validation, and testing phases.
Our team planned to filter the training, validation, and testing
datasets to retain only the articles labeled as “bias” or
“conspiracy”,
since our project focuses specifically on classifying between these two
fake news categories.
Removing irrelevant classes (like “bs”, “fake”, “hate”, etc.) ensures that: - The models are trained only on the target classes. - The evaluation metrics are meaningful and specific to the project goals.
# Filter only 'bias' and 'conspiracy' classes for each set
train_data <- train_data %>%
filter(type %in% c("bias", "conspiracy"))
validation_data <- validation_data %>%
filter(type %in% c("bias", "conspiracy"))
test_data <- test_data %>%
filter(type %in% c("bias", "conspiracy"))
# Check the updated class distribution
table(train_data$type)
##
## bias bs conspiracy fake hate junksci satire
## 311 0 301 0 0 0 0
## state
## 0
table(validation_data$type)
##
## bias bs conspiracy fake hate junksci satire
## 44 0 43 0 0 0 0
## state
## 0
table(test_data$type)
##
## bias bs conspiracy fake hate junksci satire
## 88 0 86 0 0 0 0
## state
## 0
After filtering the datasets:
Based on these results, our team concluded that the data is now fully clean and ready for machine learning model training.
Our team planned to train a Random Forest classifier on the filtered
dataset, ensuring only the “bias” and “conspiracy” classes are
present.
We also ensured unused factor levels were removed to prevent training
errors.
# Load the randomForest package
library(randomForest)
# Prepare training data: Remove the 'type' column for predictors
train_x <- train_data %>% select(-type)
train_y <- droplevels(train_data$type) # Drop unused factor levels
# Prepare validation data
validation_x <- validation_data %>% select(-type)
validation_y <- droplevels(validation_data$type)
# Train a Random Forest model
set.seed(123)
rf_model <- randomForest(x = train_x, y = train_y,
ntree = 100,
mtry = sqrt(ncol(train_x)),
importance = TRUE)
# View the model summary
print(rf_model)
##
## Call:
## randomForest(x = train_x, y = train_y, ntree = 100, mtry = sqrt(ncol(train_x)), importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 57
##
## OOB estimate of error rate: 15.03%
## Confusion matrix:
## bias conspiracy class.error
## bias 278 33 0.1061093
## conspiracy 59 242 0.1960133
After training the Random Forest classifier on the training dataset:
Our team observed that the model was trained with:
ntree = 100).mtry = sqrt(number of predictors)).The model achieved an Out-Of-Bag (OOB) error
rate of approximately 15.03% during
training,
which suggests a fairly strong performance on the training
data.
The training set confusion matrix showed:
Overall, the Random Forest model showed better classification performance for the bias class compared to the conspiracy class on the training data.
Based on these results, our team concluded that the Random Forest model is performing well enough to proceed to validation evaluation for further tuning and confirmation.
After training the Random Forest model,
our team planned to evaluate its performance on the validation
set to: - Estimate how well the model generalizes to unseen
data. - Calculate important classification metrics including: -
Accuracy - Precision -
Recall - F1-Score - Analyze any
performance gaps between classes (bias and
conspiracy).
We decided to use the caret package to generate the
confusion matrix and extract detailed evaluation metrics.
# Predict on the validation set
validation_pred <- predict(rf_model, validation_x)
# Load caret for confusion matrix
library(caret)
# Generate confusion matrix
confusionMatrix(validation_pred, validation_y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bias conspiracy
## bias 38 8
## conspiracy 6 35
##
## Accuracy : 0.8391
## 95% CI : (0.7448, 0.9091)
## No Information Rate : 0.5057
## P-Value [Acc > NIR] : 8.452e-11
##
## Kappa : 0.6779
##
## Mcnemar's Test P-Value : 0.7893
##
## Sensitivity : 0.8636
## Specificity : 0.8140
## Pos Pred Value : 0.8261
## Neg Pred Value : 0.8537
## Prevalence : 0.5057
## Detection Rate : 0.4368
## Detection Prevalence : 0.5287
## Balanced Accuracy : 0.8388
##
## 'Positive' Class : bias
##
After evaluating the Random Forest model on the validation dataset:
Based on these results, our team concluded that the Random Forest model shows good promise and is ready for final testing on the unseen test set.
After validating the Random Forest model,
our team planned to evaluate its performance on the unseen test
dataset to: - Obtain the final unbiased estimate of the model’s
generalization ability. - Calculate the same evaluation metrics
(accuracy, precision, recall, F1-score) on the test set. - Confirm if
the model maintains similar performance on truly unseen data.
We decided to use the caret package’s
confusionMatrix() function again for consistency.
# Prepare test set features and labels
test_x <- test_data %>% select(-type)
test_y <- droplevels(test_data$type)
# Predict on the test set
test_pred <- predict(rf_model, test_x)
# Generate confusion matrix
confusionMatrix(test_pred, test_y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bias conspiracy
## bias 78 17
## conspiracy 10 69
##
## Accuracy : 0.8448
## 95% CI : (0.7823, 0.8952)
## No Information Rate : 0.5057
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6893
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.8864
## Specificity : 0.8023
## Pos Pred Value : 0.8211
## Neg Pred Value : 0.8734
## Prevalence : 0.5057
## Detection Rate : 0.4483
## Detection Prevalence : 0.5460
## Balanced Accuracy : 0.8443
##
## 'Positive' Class : bias
##
After evaluating the Random Forest model on the unseen test dataset:
Based on these results, our team concluded that the Random Forest classifier is reliable and effective for distinguishing between bias and conspiracy news articles.
Our team decided to train a Support Vector Machine
(SVM) classifier
to provide a performance comparison with the Random Forest model.
SVM is particularly effective for high-dimensional datasets like
TF-IDF feature spaces,
and can perform very well in text classification tasks.
We planned to: - Use a linear kernel SVM for
simplicity and speed. - Start with the default hyperparameters and tune
later if necessary. - Use the e1071 package for SVM
implementation in R.
# Load e1071 library
library(e1071)
# Prepare training features and labels
train_x <- train_data %>% select(-type)
train_y <- droplevels(train_data$type)
# Train SVM model with linear kernel
set.seed(123)
svm_model <- svm(x = train_x, y = train_y,
kernel = "linear",
probability = TRUE)
## Warning in svm.default(x = train_x, y = train_y, kernel = "linear", probability
## = TRUE): Variable(s) 'factori' and 'korean' and 'neoconserv' and 'nixon' and
## 'theft' and 'translat' and 'tribe' and 'adjust' and 'extern' and 'cuba' and
## 'mask' and 'profound' and 'betray' and 'format' and 'armor' and 'bottl' and
## 'satisfi' and 'whoever' and 'neoliber' and 'storag' and 'fierc' and 'consumpt'
## and 'liquid' and 'string' and 'metal' and 'song' and 'span' and 'contamin' and
## 'uniti' and 'cup' and 'awaken' and 'illus' and 'protector' and 'render' and
## 'bullet' and 'displac' and 'faction' and 'rebuild' and 'nake' and 'loud' and
## 'button' and 'tap' and 'israel’' and 'fruit' and 'peak' and 'greec' and
## 'difficulti' and 'flame' and 'republish' and 'clinic' and 'shit' and 'disord'
## and 'reprint' and 'para' and 'por' and 'que' and 'una' and 'за' and 'как' and
## 'на' and 'не' and 'по' and 'что' and 'это' and 'для' and 'из' and '›' and
## 'pravdaru' constant. Cannot scale data.
# View model summary
summary(svm_model)
##
## Call:
## svm.default(x = train_x, y = train_y, kernel = "linear", probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 574
##
## ( 288 286 )
##
##
## Number of Classes: 2
##
## Levels:
## bias conspiracy
After training the Support Vector Machine (SVM) model:
Based on these results, our team concluded that the SVM model is trained properly and ready for validation set evaluation.
After training the SVM model,
our team planned to evaluate its performance on the validation
set to: - Estimate how well the SVM model generalizes to unseen
data. - Calculate key classification metrics such as: -
Accuracy - Precision -
Recall - F1-Score - Compare the SVM’s
performance against the Random Forest model.
We used the predict() function for generating
predictions
and the confusionMatrix() function from the
caret package for evaluation.
# Predict on the validation set using SVM model
svm_validation_pred <- predict(svm_model, validation_x)
# Generate confusion matrix
confusionMatrix(svm_validation_pred, validation_y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bias conspiracy
## bias 32 10
## conspiracy 12 33
##
## Accuracy : 0.7471
## 95% CI : (0.6425, 0.8342)
## No Information Rate : 0.5057
## P-Value [Acc > NIR] : 3.582e-06
##
## Kappa : 0.4945
##
## Mcnemar's Test P-Value : 0.8312
##
## Sensitivity : 0.7273
## Specificity : 0.7674
## Pos Pred Value : 0.7619
## Neg Pred Value : 0.7333
## Prevalence : 0.5057
## Detection Rate : 0.3678
## Detection Prevalence : 0.4828
## Balanced Accuracy : 0.7474
##
## 'Positive' Class : bias
##
After evaluating the SVM model on the validation dataset:
Based on these results, our team concluded that while the SVM model
is functional,
the Random Forest model remains the stronger candidate for final testing
and deployment.
After validating the SVM model,
our team planned to evaluate its final performance on the unseen
test set to: - Obtain an unbiased estimate of the SVM model’s
true generalization ability. - Calculate key performance metrics
including accuracy, precision, recall, and F1-score. - Compare the SVM’s
test performance against the Random Forest model to finalize the best
model for deployment.
We continued using the caret package’s
confusionMatrix() function for consistency in
evaluation.
# Prepare test set features and labels (if not already done earlier)
test_x <- test_data %>% select(-type)
test_y <- droplevels(test_data$type)
# Predict on the test set using SVM model
svm_test_pred <- predict(svm_model, test_x)
# Generate confusion matrix
confusionMatrix(svm_test_pred, test_y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bias conspiracy
## bias 74 18
## conspiracy 14 68
##
## Accuracy : 0.8161
## 95% CI : (0.7504, 0.8707)
## No Information Rate : 0.5057
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6319
##
## Mcnemar's Test P-Value : 0.5959
##
## Sensitivity : 0.8409
## Specificity : 0.7907
## Pos Pred Value : 0.8043
## Neg Pred Value : 0.8293
## Prevalence : 0.5057
## Detection Rate : 0.4253
## Detection Prevalence : 0.5287
## Balanced Accuracy : 0.8158
##
## 'Positive' Class : bias
##
After evaluating the SVM model on the unseen test dataset:
Based on these results, our team concluded that while the SVM model
performed reasonably well,
the Random Forest model remains the better-performing classifier for our
fake news classification task between bias and
conspiracy articles.
After completing the training and evaluation of both Random Forest
and SVM models,
our team compared their performance on the validation and test
datasets:
| Model | Validation Accuracy | Test Accuracy | Validation Kappa | Test Kappa |
|---|---|---|---|---|
| Random Forest | 83.91% | 84.48% | 0.6779 | 0.6893 |
| SVM | 74.71% | 81.61% | 0.4945 | 0.6319 |
Key observations from the comparison: - Random Forest
consistently outperformed SVM in both validation and test sets.
- Random Forest achieved higher overall accuracy,
balanced accuracy, and kappa scores,
indicating stronger and more reliable classification performance. -
SVM, although performing reasonably well, showed
slightly lower precision and recall,
especially in distinguishing between bias and
conspiracy articles.
Based on the complete analysis and modeling results,
our team concluded the following:
Overall, the Random Forest classifier proved to be a reliable approach for fake news categorization based on textual features.
For future improvement, our team suggests: - Incorporating external credibility scores of news sources to enhance model inputs. - Handling class imbalance more effectively using SMOTE or oversampling techniques. - Exploring deep learning models (e.g., BERT-based classifiers) for even better semantic understanding. - Feature selection or dimensionality reduction to further optimize model training time and memory usage.