Project Libraries
library(stringr)
library(tidyverse)
library(dplyr)
library(tm)
library(stringr)
library(stringi)
library(SnowballC)
library(textstem)
library(sentimentr)
library(text2vec)
library(caret)
library(rvest)
library(tidytext)
library(randomForest)
library(topicmodels)
library(wordcloud)
library(wordcloud2)
library(grid)
library(gridExtra)
library(NLP)
set.seed(500)Project Summary
Introduction
Much of today’s online news media is consumed through social media channels. Unfortunately, it can be very difficult to determine bias when skimming through articles in your news feed. Combining this with the fact that most social media platforms today utilize algorithms that create feedback loops influencing which types of content a user views, we can see how easy it is for an individual to see their newsfeed quickly change to match whatever “bias” they’ve searched for, liked, or commented on in the past.
This can be particularly dangerous when considering news about public health trends. Taking the COVID pandemic as an example, we can see that online communities have become extremely polarized relative to subjects such as Vaccine Safety, Mask Effectiveness, and Virus Transmisibility. Depending on what type of news a person consumes, that person’s perspective of the pandemic could be radically different than the perspective of another individual.
This projects aims to analyze the underlying differences in discource related to COVID by political leanings (“Left”, “Right”, and “Center”), and to develop a multi-class classification model to identify potential spin for new articles seen online.
Data Collection
This project uses online news article data, sourced from 10 sources. Articles were scraped and enriched using Python, NewsAPI package, and Newspaper3k package (see github).
The sources were tagged with guidance from allsides.com, a website that uses crowsdsourcing to vote for news source political leanings:
sources_mapping <- c(
"bbc-news" = "Center",
"associated-press" = "Center",
"the-american-conservative" = "Right",
"national-review" = "Right",
"breitbart-news" = "Right",
"fox-news" = "Right",
"cnn" = "Left",
"the-washington-post" = "Left",
"vice-news" = "Left",
"buzzfeed" = "Left"
)Conclusion
From my analysis, I have determined that there are differences to the ways that various news sources portray information about COVID 19, depending on the assigned political leanings of the news source. Additionally, I have proven that machine learning models can be used that consider a document’s text, text structure, sentiment, and emotion in order to classify political leaning.
In this study, multiple models were tested, including Regularized Logistic Regression, Random Forest, Bagged Logistic Regression, and Naive Bayes. Generally speaking, most models were able to improve upon our baseline classification threshold of 34%. Top models were able to score accuracy levels over 80%. This suggests that machine learning can be used to identify the different patterns of text and topic between news sources of differing political biases.
Project Code
Load Data
article_data <- read_csv("https://raw.githubusercontent.com/man-of-moose/masters_607/main/projects/final_project/enriched_data.csv", show_col_types = FALSE)article_data %>%
ggplot() +
geom_bar(aes(x=target_response, fill=target_response))Data Cleaning and Feature Engineering
Initial Data Cleaning
Some of the articles were loaded with incorrectly formatted text. Here we replace any non UTF8 text with blank spaces.
article_data <- article_data %>%
select(-source_id) %>%
mutate_if(
is.character, ~gsub('[^ -~]', '', .))For our classification, we will consider all text present in the document title, description, and content (sourced from NewsAPI). We will also consider the scraped “full text” (sourced from Newspaper3k).
article_data <- article_data %>%
mutate(
text_predictor = str_c(title," ",description," ",content," ",full_text)
)Feature Engineering
Document Structure Features
Understanding the structure of a document may be critical to classifying it’s potential bias. Structure can include anything like “the length of the article”, the “number of unique words”, etc.
I believe that the total count of unique words in an article may be indicative of the political leanings of the article. We will create a function here that counts the total number of unique words from a document.
unique_words_count <- function(string){
string = str_replace_all(string, pattern = "[[:punct:]]", replacement= " ")
string = str_replace_all(string, "\\s+", " ")
string = str_trim(string)
string = unlist(str_split(string, pattern= " "))
unique_count = length(unique(string))
return(unique_count)
}I believe that the total count of unique lemmas in an article may be indicative of the political leanings of the article. We will create a function here that counts the total number of unique lemmas from a document.
unique_lemmas_count <- function(string){
string = str_replace_all(string, pattern = "[[:punct:]]", replacement= " ")
string = str_replace_all(string, "\\s+", " ")
string = str_trim(string)
lemmas = lemmatize_strings(string, dictionary = lexicon::hash_lemmas)
lemmas = unlist(str_split(lemmas, pattern= " "))
unique_count = length(unique(lemmas))
return(unique_count)
}In addition to the total unique number of lemmas, I think that the total number of lemmas will also be interesting to review. Especially when we combine with total unique lemmas to get the ratio of unique to total.
total_lemmas_count <- function(string){
string = str_replace_all(string, pattern = "[[:punct:]]", replacement= " ")
string = str_replace_all(string, "\\s+", " ")
string = str_trim(string)
lemmas = lemmatize_strings(string, dictionary = lexicon::hash_lemmas)
lemmas = unlist(str_split(lemmas, pattern= " "))
total_count = length(lemmas)
return(total_count)
}Here we apply the above functions to add a few additional features to our model.
article_data <- article_data %>%
mutate(
predictor_length = str_length(str_replace_all(text_predictor, " ", "")),
total_words = str_count(text_predictor, '\\w+'),
avg_word_size = predictor_length / total_words,
unique_words = unlist(lapply(text_predictor, unique_words_count)),
unique_word_ratio = unique_words / total_words,
total_lemmas = unlist(lapply(text_predictor, total_lemmas_count)),
unique_lemmas = unlist(lapply(text_predictor, unique_lemmas_count)),
unique_lemma_ratio = unique_lemmas / total_lemmas
) Sentiment Features
Using sentimentr package, we can easily apply sentiment analysis, including both sentiment scores as well as emotion classifications. We will additionally test to see if these features will provide any significance to our model.
sentiment_data <- sentiment_by(article_data$text_predictor) %>%
select(ave_sentiment)article_data <- cbind(article_data, sentiment_data)emotion_data <- emotion_by(article_data$text_predictor) %>%
select(element_id, emotion_type, ave_emotion) %>%
group_by(element_id) %>%
pivot_wider(names_from = "emotion_type", values_from = "ave_emotion") %>%
ungroup()colnames(emotion_data) <- str_c(colnames(emotion_data),"sent")article_data <- cbind(article_data, emotion_data)EDA
Evaluating some of the Structural Features for importance
Evaluate if predictor_length is relevant in the model
article_data %>%
ggplot() +
geom_boxplot(aes(x=predictor_length, fill=target_response))anova <- aov(predictor_length ~ target_response, data = article_data)
summary(anova)## Df Sum Sq Mean Sq F value Pr(>F)
## target_response 2 1.773e+07 8863469 0.847 0.429
## Residuals 415 4.340e+09 10458527
Based on the results of the ANOVA test, it seems that the predictor_length may not play a major role in the classification of documents (p-value > 0.05)
article_data %>%
ggplot() +
geom_boxplot(aes(x=unique_word_ratio, fill=target_response))anova <- aov(unique_word_ratio ~ target_response, data = article_data)
summary(anova)## Df Sum Sq Mean Sq F value Pr(>F)
## target_response 2 0.283 0.14133 11.04 2.14e-05 ***
## Residuals 415 5.314 0.01281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the results of the ANOVA test, it seems that the unique_word_ratio plays a major role in the classification of documents (p-value < 0.05)
Visualizing Sentiment
article_data %>%
ggplot() +
geom_boxplot(aes(x=ave_sentiment, fill=target_response), outlier.colour = "red")anova <- aov(ave_sentiment ~ target_response, data = article_data)
summary(anova)## Df Sum Sq Mean Sq F value Pr(>F)
## target_response 2 0.000 0.000119 0.007 0.993
## Residuals 415 7.071 0.017038
Based on the results, sentiment doesn’t seem to play a major role in the classification of political leanings. This goes against my expectations.
article_data %>%
ggplot() +
geom_boxplot(aes(x=trust_negatedsent, fill=target_response), outlier.colour = "red")anova <- aov(trust_negatedsent ~ target_response, data = article_data)
summary(anova)## Df Sum Sq Mean Sq F value Pr(>F)
## target_response 2 0.000032 1.599e-05 3.424 0.0335 *
## Residuals 415 0.001938 4.670e-06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Emotion data can certainly be a help when classifying documents! As we can see here, it appears that articles from right leaning political sources score higher on the “trust_negated” index, which suggests that their articles have more doubt related to COVID news.
Build a Corpus and conduct additional EDA
In order to use the new features that we created earlier, we will want to save them into their own dataframe. We will bind this ‘additional_features’ dataframe to the document term matrix that is producted using tm-package.
additional_features <- article_data %>%
select(
predictor_length,total_words,avg_word_size,unique_words,unique_word_ratio,
total_lemmas,unique_lemmas,unique_lemma_ratio,ave_sentiment,angersent,
anger_negatedsent,anticipationsent,anticipation_negatedsent,disgustsent,
disgust_negatedsent,fearsent,fear_negatedsent,joysent,joy_negatedsent,
sadnesssent,sadness_negatedsent,surprisesent,surprise_negatedsent,
trustsent,trust_negatedsent,target_response
)In order to use as many models later on, I will normalize these values with a custom min max function
min_max_norm <- function(x) {
(x - min(x)) / (max(x) - min(x))
}additional_features_norm <- as.data.frame(lapply(additional_features%>%select(-target_response), min_max_norm))
additional_features_norm$target_response <- additional_features$target_responseWe will use tm-package’s stopwords, with some additional terms to ensure that we aren’t classifying documents based simply on their source name.
my_stopwords = tm::stopwords("english")
my_stopwords <- c(my_stopwords, "fox", "bbc", "apress","associated press","cnn","breitbart","vice","buzzfeed", "washington post", "li li")And now we build the corpus using tm-package
corpus <- VCorpus(VectorSource(article_data$text_predictor)) %>%
tm_map(content_transformer(str_replace_all),
pattern = "[^\\d]\\d{3}[^\\d]",
replacement = " ") %>%
tm_map(content_transformer(str_replace_all),
pattern = "[<>$]+",
replacement = " ") %>%
# replace punctuation with spaces
tm_map(content_transformer(str_replace_all),
pattern = "[[:punct:]]",
replacement = "") %>%
# replace white spaces with single whitespaces
tm_map(content_transformer(str_replace_all),
pattern = "\\s+",
replacement = " ") %>%
# transform everything to lower case
tm_map(content_transformer(tolower)) %>%
# remove stopwords
tm_map(content_transformer(removeWords), my_stopwords) %>%
# stem
tm_map(content_transformer(lemmatize_words))In order to review by target_response later, we will append meta data to the documents.
for (i in 1:length(corpus)){
bias = article_data$target_response[i]
meta(corpus[[i]], "target_response") <- bias
}left_corpus <- tm_filter(corpus, FUN = function(x) meta(x)[["target_response"]] == "Left")
right_corpus <- tm_filter(corpus, FUN = function(x) meta(x)[["target_response"]] == "Right")
center_corpus <- tm_filter(corpus, FUN = function(x) meta(x)[["target_response"]] == "Center")Here were create a DocumentTermMatrix, and additionally create a final dataframe ‘data’ that includes the additional features we will want to test against later.
I create two version of the same matrix, once with word frequency, and once with TFIDF.
Additionally, I create a custom function in order to generate 1, 2-grams from the predictor text for modeling.
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 1:2), paste, collapse = " "), use.names = FALSE)set.seed(500)
dtm <- DocumentTermMatrix(corpus, control=list(tokenize=BigramTokenizer))
dtm <- removeSparseTerms(dtm, 1-(4/length(corpus)))
data <- as.matrix(dtm)
data <- cbind(data, additional_features_norm)
data <- as.data.frame(data)
data$target_response <- as.factor(data$target_response)
names(data) <- make.names(names(data))set.seed(500)
dtm_tfidf = weightTfIdf(dtm, normalize = TRUE)
dtm_tfidf <- removeSparseTerms(dtm_tfidf, 1-(4/length(corpus)))
data_tfidf <- as.matrix(dtm_tfidf)
data_tfidf <- cbind(data_tfidf, additional_features_norm)
data_tfidf <- as.data.frame(data_tfidf)
data_tfidf$target_response <- as.factor(data_tfidf$target_response)
names(data_tfidf) <- make.names(names(data_tfidf))For some of the models we want to test, it’s important that we add a character to the start of each feature name. As an example, randomforest() function does not work with numerically named columns.
response_col <- which(colnames(data) == "target_response")
colnames(data)[-response_col] <- paste0( "v", colnames(data)[-response_col])response_col <- which(colnames(data_tfidf) == "target_response")
colnames(data_tfidf)[-response_col] <- paste0( "v", colnames(data_tfidf)[-response_col])Wordclouds
left_tdm <- TermDocumentMatrix(left_corpus, control=list(tokenize=BigramTokenizer))
left_tdm = weightTfIdf(left_tdm, normalize = TRUE)
left_tdm_matrix <- as.matrix(left_tdm)
left_words <- sort(rowSums(left_tdm_matrix),decreasing=TRUE)
left_df <- data.frame(word = names(left_words),freq=left_words)wordcloud(words = left_df$word, freq = left_df$freq, min.freq = 1, max.words= 200, scale=c(1.5,0.25), random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))right_tdm <- TermDocumentMatrix(right_corpus, control=list(tokenize=BigramTokenizer))
right_tdm = weightTfIdf(right_tdm, normalize = TRUE)
right_tdm_matrix <- as.matrix(right_tdm)
right_words <- sort(rowSums(right_tdm_matrix),decreasing=TRUE)
right_df <- data.frame(word = names(right_words),freq=right_words)wordcloud(words = right_df$word, freq = right_df$freq, min.freq = 1, max.words= 200, scale=c(1.5,0.25), random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))center_tdm <- TermDocumentMatrix(center_corpus, control=list(tokenize=BigramTokenizer))
center_tdm = weightTfIdf(center_tdm, normalize = TRUE)
center_tdm_matrix <- as.matrix(center_tdm)
center_words <- sort(rowSums(center_tdm_matrix),decreasing=TRUE)
center_df <- data.frame(word = names(center_words),freq=center_words)wordcloud(words = center_df$word, freq = center_df$freq, min.freq = 1, max.words= 200, scale=c(1.2,0.25), random.order=FALSE, rot.per=0.35,colors=brewer.pal(8, "Dark2"))LDA Analysis - Topic Modeling
We can use the LDA function to perform lda analysis on our document term matrix. We will set an arbitrary value for the number of topics to be created as 6
data_lda <- LDA(dtm, k = 5, control = list(seed=100))
data_lda## A LDA_VEM topic model with 5 topics.
word_topics <- tidy(data_lda, matrix = "beta")top_terms <- word_topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered()From the above charts, we can see which words are most related to each of the generated topics following LDA. We can use our own best judgement to categorize each of the topics.
Topic 1 - work, mandates
Topic 2 - interviews
Topic 3 - sports
Topic 4 - travel
Topic 5 - children
document_gamma <- tidy(data_lda, matrix = "gamma")document_classification <- document_gamma %>%
group_by(document) %>%
slice_max(gamma) %>%
ungroup() %>%
mutate(
document = as.integer(document)
) %>%
arrange(document)document_classification$topic <- replace(document_classification$topic, document_classification$topic==1, "work, mandates")
document_classification$topic <- replace(document_classification$topic, document_classification$topic==2, "interviews")
document_classification$topic <- replace(document_classification$topic, document_classification$topic==3, "sports")
document_classification$topic <- replace(document_classification$topic, document_classification$topic==4, "travel")
document_classification$topic <- replace(document_classification$topic, document_classification$topic==5, "children")topic_models_data <- cbind(article_data, document_classification%>%select(topic))topic_models_data %>%
ggplot() +
geom_bar(aes(x=target_response, fill=target_response)) +
facet_wrap(~topic, nrow=1) +
theme(legend.position = "none") +
coord_fixed(.1)Testing Models
I will be testing 4 models, using both word frequency and tfidf document term matrices.
dt = sort(sample(nrow(data), nrow(data)*.65))
train_data<-data[dt,]
test_data<-data[-dt,]dt = sort(sample(nrow(data_tfidf), nrow(data_tfidf)*.65))
train_data_tfidf<-data_tfidf[dt,]
test_data_tfidf<-data_tfidf[-dt,]In order to visualize confusion matrices and associated model scoring, I will use the following custom function
plot_confusion <- function(cm) {
# extract the confusion matrix values as data.frame
cm_d <- as.data.frame(t$table)
# confusion matrix statistics as data.frame
cm_st <-data.frame(t$overall)
# round the values
cm_st$t.overall <- round(cm_st$t.overall,2)
# here we also have the rounded percentage values
cm_p <- as.data.frame(prop.table(t$table))
cm_d$Perc <- round(cm_p$Freq*100,2)
cm_d_p <- ggplot(data = cm_d, aes(x = Prediction , y = Reference, fill = Freq))+
geom_tile() +
geom_text(aes(label = paste("",Freq)), color = 'green', size = 3) +
theme_light() +
guides(fill=FALSE)
# plotting the stats
cm_st_p <- tableGrob(cm_st)
# all together
grid.arrange(cm_d_p, cm_st_p,nrow = 1, ncol = 2,
top=textGrob("Confusion Matrix and Statistics",gp=gpar(fontsize=25,font=1)))
}Baseline
First I will compute the baseline accuracy and add that to our model tracker for review
article_data %>%
count(target_response) %>%
mutate(
n = (n/sum(n))**2
) %>%
select(n) %>%
sum()## [1] 0.3471761
model_tracking <- data.frame("model"=c("baseline"), "accuracy"=.3471)We will be using 5-fold cross validation for all models
train_control <- trainControl(method = "cv",
number = 5)Regularized Logistic Regression
Word Frequency
regLog_model <- train(target_response~.,
data = train_data,
method = 'regLogistic',
trControl = train_control)regLog_result <- predict(regLog_model, newdata = test_data%>%select(-target_response))t <- confusionMatrix(test_data$target_response, regLog_result)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("reglog",accuracy))TFIDF
regLog_tfidf_model <- train(target_response~.,
data = train_data_tfidf,
method = 'regLogistic',
trControl = train_control)regLog_tfidf_result <- predict(regLog_tfidf_model, newdata = test_data_tfidf%>%select(-target_response))t <- confusionMatrix(test_data$target_response, regLog_tfidf_result)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("reglog_tfidf",accuracy))Random Forest
Word Frequency
rf_model <- randomForest(target_response~., data = train_data)rf_result <- predict(rf_model, newdata = test_data%>%select(-target_response))t <- confusionMatrix(test_data_tfidf$target_response, rf_result)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("randomforest",accuracy))TFIDF
rf_tfidf_model <- randomForest(target_response~., data = train_data_tfidf)rf_tfidf_result <- predict(rf_model, newdata = test_data_tfidf%>%select(-target_response))t <- confusionMatrix(test_data_tfidf$target_response, rf_tfidf_result)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("randomforest_tfidf",accuracy))Bagged Logit Model
Word Frequency
bagged_logic_model <- train(target_response~.,
data = train_data,
method = 'LogitBoost',
trControl = train_control)bagged_logic_result <- predict(bagged_logic_model, newdata = test_data%>%select(-target_response))t <- confusionMatrix(test_data$target_response, bagged_logic_result)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("bagged_logit",accuracy))TFIDF
bagged_logic_tfidf_model <- train(target_response~.,
data = train_data_tfidf,
method = 'LogitBoost',
trControl = train_control)bagged_logic_tfidf_result <- predict(bagged_logic_tfidf_model, newdata = test_data_tfidf%>%select(-target_response))t <- confusionMatrix(test_data_tfidf$target_response, bagged_logic_tfidf_result)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("bagged_logit_tfidf",accuracy))Naive Bayes
Word Frequency
nb_model <- train(target_response~.,
data = train_data,
method = 'naive_bayes',
trControl = train_control)nb_results <- predict(nb_model, newdata = test_data%>%select(-target_response))t <- confusionMatrix(test_data$target_response, nb_results)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("naive_bayes",accuracy))TFIDF
nb_tfidf_model <- train(target_response~.,
data = train_data_tfidf,
method = 'naive_bayes',
trControl = train_control)nb_tfidf_results <- predict(nb_tfidf_model, newdata = test_data_tfidf%>%select(-target_response))t <- confusionMatrix(test_data_tfidf$target_response, nb_tfidf_results)
accuracy <- t$overall[[1]]plot_confusion(t)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
model_tracking <- rbind(model_tracking, c("naive_bayes_tfidf",accuracy))Visualize Model Performance
# Barplot
model_tracking %>%
mutate(
accuracy = as.double(accuracy),
accuracy = round(accuracy,4)
) %>%
ggplot(aes(x=model, y=accuracy)) +
geom_bar(aes(fill=model),stat = "identity") +
geom_text(aes(label=accuracy)) +
coord_flip() +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, hjust=.5)) +
theme(legend.position = "none")