Online News Quality Assessment
1 Introduction
The biggest challenge of the so-called impartial media news is to remove the bias that is intrinsic to any journalist.
The journalist should always focus on facts (positive or negative) and try as much as possible to eliminate its own opinions. However, this task is very challenging because the description of facts will always be affected by how the journalist sees the world, his/her values, culture, etc.
Other than that, a political polarization phenomenon is apparently increasing in recent years. In many cases, it is possible to identify feelings like hate or devotion in political news.
This polarization phenomenon is making the mainstream media news lose its credibility, and create a problem for its audience, which don’t know what to trust anymore.
This project will address this issue by evaluating the political news’s quality using sentiment analysis and the two following assumptions.
- Assumption 1: There is no unbiased news. Journalists will always use their world view, values, feelings, culture, religion, etc., to describe any fact.
- Assumption 2: The best political news are the ones that consider the positive and negative aspects of any political fact.
This project aims to assess some political news’s sentiment level to evaluate their quality based on the two assumptions above. In other words, the news’s quality will be graded on the balance of sentiments, which should reflect positive and negative analysis of the political fact.
This project will start with a simple example of a fact, the Covid-19 Relief Plan recently approved by the US President.
Searching for the words “Biden covid relief plan” on Google News. I collected news from 9 different online newspapers to make the analysis of sentiment, to try to capture if the news described the fact deeming the positive and negative sides of it.
The ultimate goal of this project is to create a raking of online newspapers to give the reader a tool to select the best online newspapers in terms of a broad view of political facts.
PS: This project has no political side. The objective here is to test the NLP method’s capabilities identifying different sentiments based on different views of the same subject.
1.1 The news used in this project are the following:
- https://www.cnbc.com/2021/03/11/biden-1point9-trillion-covid-relief-package-thursday-afternoon.html
- https://www.foxbusiness.com/economy/heres-how-the-10200-unemployment-tax-break-in-bidens-covid-relief-plan-works
- https://www.usatoday.com/in-depth/news/2021/03/10/covid-19-stimulus-bill-joe-bidens-plan-explained-6-graphics/4601454001/
- https://newrepublic.com/article/161838/republican-ag-lawsuits-against-american-rescue-plan-biden-covid-relief
- https://www.cnn.com/2021/03/11/politics/biden-sign-covid-bill/index.html
- https://www.bbc.com/news/world-us-canada-56355124
- https://www.washingtonpost.com/us-policy/2021/03/10/house-stimulus-biden-covid-relief-checks/
- https://www.edweek.org/policy-politics/see-what-the-huge-covid-19-aid-deal-biden-has-signed-means-for-education-in-two-charts/2021/03
- https://www.thefiscaltimes.com/2021/03/11/Tax-Hikes-Tucked-Biden-s-Covid-Relief-Plan
2 Importing the data and creating a corpus
2.0.1 Setting some initial configuration.
Setting decimal values with 3 digits and avoiding any scientific numeric notation.
options(digits = 3, scipen = 9999)2.1 Reading text from the news.
relief <- readtext("datasets/news/*.txt",
text_field = "text",
encoding = "UTF-8")2.2 Creating a corpus and ordering it by Types (unique words in the text)
relief.corpus <- corpus(relief)
docvars(relief.corpus, "newschannel") <-
sapply(strsplit(relief.corpus %>% names(), "\\."), "[", 1)
relief.corpus.summary <-
summary(relief.corpus) %>% arrange(desc(Types))
str(relief.corpus.summary)## Classes 'summary.corpus' and 'data.frame': 9 obs. of 5 variables:
## $ Text : chr "thesoapbox.txt" "washingtonpost.txt" "bbc.txt" "thefiscaltimes.txt" ...
## $ Types : int 799 735 530 436 425 327 290 277 262
## $ Tokens : int 1908 1972 1180 966 940 660 577 527 539
## $ Sentences : int 55 62 44 35 38 22 18 19 24
## $ newschannel: chr "thesoapbox" "washingtonpost" "bbc" "thefiscaltimes" ...
## - attr(*, "ndoc_all")= int 9
## - attr(*, "ndoc_show")= int 9
2.3 Creating a Document Term Matrix
Creating a DTM using documents as units. This will allow the comparison of positive and negative sentiment in each doc.
relief.dtm = dfm(
relief.corpus,
tolower = T,
remove = cbind(stopwords("en"),
c("biden", "american", "relief", "plan")),
remove_punc = T,
remove_numbers = T,
remove_symbols = T
)
relief.dtm## Document-feature matrix of: 9 documents, 1,874 features (81.8% sparse) and 1 docvar.
## features
## docs three reasons biden's covid bill big deal back
## bbc.txt 2 1 6 7 14 3 4 2
## cnbc.txt 0 0 0 1 4 0 0 0
## cnn.txt 1 0 3 0 3 0 0 0
## educationweek.txt 0 0 1 0 2 0 2 0
## foxbusiness.txt 0 0 2 1 0 0 0 0
## thefiscaltimes.txt 2 0 3 2 3 1 0 0
## features
## docs vice-president joe
## bbc.txt 2 1
## cnbc.txt 0 2
## cnn.txt 0 1
## educationweek.txt 0 2
## foxbusiness.txt 0 0
## thefiscaltimes.txt 0 0
## [ reached max_ndoc ... 3 more documents, reached max_nfeat ... 1,864 more features ]
3 Analyzing the News
3.1 Creating a cloud to visualize the most frequent words in all news combined
textplot_wordcloud(relief.dtm, max_words = 50, color = c("blue", "red"))The most interesting terms in this cloud are “tax” and “bill”, which may show that the news are expressing some concern about how the relief plan will be paid.
3.2 Table with the (10) most used terms
tf <- textstat_frequency(relief.dtm, n=10)
tf %>% ggplot(aes(fct_reorder(feature, frequency),
frequency,
fill = frequency)
) +
geom_col() +
coord_flip() +
theme(axis.text.y
= element_text(size=rel(2),
angle = 45,
vjust = 1,
hjust = 1))3.3 Comparing corpora
Comparing frequencies
names <- docvars(relief.dtm)[["newschannel"]]
for (unit in seq_along(names)){
presence = docvars(relief.dtm)[["newschannel"]] == names[unit]
ts = textstat_keyness(relief.dtm, presence)
t = textplot_keyness(ts, n =10)
plot(t)
}These comparisons above show some differences among the news. Three groups can be observed.
cnbc, cnn educationweek, usatodaynews are more interested in the positive effects of the relief plan
bbc - bill covid support
cnbc - vaccination, parade, legislation
cnn - covid vaccines
educationweek - education recovery stabilization
usatodaynews - small business benefits obamacare
foxbusiness and thefiscaltimes are more interested in the possible negative side effects of the relief plan
foxbusiness - taxable taxes unemployment
thefiscaltimes - tax increases pay
thesoapbox and washingtonpost are more interested in the legal aspects of the relief plan
thesoapbox - amendment, decision, attourney, court, justice
washingtonpost - house, lawmakers, party
3.4 Testing Similarity
tstat_dist <- as.dist(textstat_dist(relief.dtm))
relief.clust <- hclust(tstat_dist)
plot(relief.clust)This test was not very informative because it doesn’t clearly show distinct groups.
4 Analysis of Sentiment
4.1 Comparing two dictionaries
res_dict_quanteda = relief.dtm %>%
dfm_lookup(data_dictionary_LSD2015[1:2]) %>%
convert(to = "data.frame") %>%
as_tibble
res_dict_quanteda = res_dict_quanteda %>% mutate(length=ntoken(relief.dtm))
res_dict_quanteda## # A tibble: 9 x 4
## doc_id negative positive length
## <chr> <dbl> <dbl> <int>
## 1 bbc.txt 33 62 575
## 2 cnbc.txt 10 25 281
## 3 cnn.txt 7 16 323
## 4 educationweek.txt 5 44 294
## 5 foxbusiness.txt 11 16 272
## 6 thefiscaltimes.txt 15 22 459
## 7 thesoapbox.txt 66 60 966
## 8 usatodaynews.txt 9 58 454
## 9 washingtonpost.txt 42 79 992
dictGI = dictionary(DictionaryGI)
res_dict_GI = relief.dtm %>%
dfm_lookup(dictGI) %>%
convert(to = "data.frame") %>%
as_tibble
res_dict_GI = res_dict_GI %>% mutate(length=ntoken(relief.dtm))
res_dict_GI## # A tibble: 9 x 4
## doc_id negative positive length
## <chr> <dbl> <dbl> <int>
## 1 bbc.txt 41 74 575
## 2 cnbc.txt 14 21 281
## 3 cnn.txt 13 19 323
## 4 educationweek.txt 9 37 294
## 5 foxbusiness.txt 25 14 272
## 6 thefiscaltimes.txt 43 38 459
## 7 thesoapbox.txt 64 107 966
## 8 usatodaynews.txt 19 56 454
## 9 washingtonpost.txt 51 87 992
The dictionary DictionaryGI is showing better results because It capturing more positive and negative words. So, I am going to proceed with the DictionaryGI.
4.2 Analyzing the most frequent terms (positives and negatives)
freqs = textstat_frequency(relief.dtm)
freqs %>% as_tibble() %>% filter(feature %in% dictGI$positive)## # A tibble: 147 x 5
## feature frequency rank docfreq group
## <chr> <dbl> <int> <dbl> <chr>
## 1 rescue 24 10 8 all
## 2 aid 23 13 6 all
## 3 help 22 15 7 all
## 4 law 22 15 5 all
## 5 credit 20 21 6 all
## 6 support 15 36 4 all
## 7 education 10 57 2 all
## 8 make 9 63 4 all
## 9 deal 8 74 3 all
## 10 pay 8 74 3 all
## # ... with 137 more rows
4.3 Removing some neutral terms from the dictionary
as.data.frame(kwic(relief.corpus, "make")) %>%
transmute(make = paste(pre, " < ", keyword, " > ", post)) ## make
## 1 will also boost provisions to < make > health care more affordable and
## 2 clear that they want to < make > permanent key elements , like
## 3 and say they want to < make > sure that additional spending is
## 4 to twice that - would < make > it impossible to pay for
## 5 executive only the responsibility to < make > … factual findings … and
## 6 observed , that formula would < make > " most of Government …
## 7 hyperbolic . Federal agencies routinely < make > " policy judgments " to
## 8 testing , modifying classrooms to < make > them safer , improving ventilation
## 9 expands eligibility to families who < make > no or very little income
as.data.frame(kwic(relief.corpus, "deal")) %>%
transmute(deal = paste(pre, " < ", keyword, " > ", post))## deal
## 1 Covid bill is a big < deal > Back in 2010 , then
## 2 to emphasise how big a < deal > he thought congressional passage of
## 3 has his own big congressional < deal > - a $ 1.9tn (
## 4 or Franklin Roosevelt's Depression-era New < Deal > programmes in size and scope
## 5 What the Huge COVID-19 Aid < Deal > Biden Has Signed Means for
## 6 to help students and educators < deal > with the various impacts of
## 7 " the terms of the < deal > . In Senate testimony and
## 8 mandates . Since the New < Deal > , this doctrine has been
as.data.frame(kwic(relief.corpus, "help")) %>%
transmute(help = paste(pre, " < ", keyword, " > ", post))## help
## 1 which the legislation will also < help > fund - the US could
## 2 and expand tax credits to < help > businesses keep employees on the
## 3 dollars for K-12 schools to < help > students return to the classroom
## 4 approximately $ 129 billion to < help > students and educators deal with
## 5 tax increases Democrats picked to < help > keep their plan's cost in
## 6 that could be proposed to < help > pay for those future plans
## 7 , extend unemployment benefits and < help > reopen schools . Small businesses
## 8 capacity ; improving technology to < help > disadvantaged students , and providing
## 9 Plan continues earlier efforts to < help > small - but key -
## 10 the Paycheck Protection Program to < help > small business . This builds
## 11 The money is designed to < help > small landlords as well .
## 12 17 for one year to < help > combat the economic damage of
## 13 increase in anti-poverty programs to < help > millions of families still struggling
## 14 bill approves additional money to < help > schools reopen , allow restaurants
## 15 doubles as an attempt to < help > Americans who were struggling long
## 16 promised relief , and now < help > is on the way ,
## 17 expansion to federal programs that < help > Americans afford food in the
## 18 $ 7 billion effort to < help > students obtain Internet access .
## 19 start seeing some of the < help > show up in her bank
## 20 1,400 stimulus check , would < help > her cover much-needed car repairs
## 21 is going to do is < help > people catch up , "
## 22 significant relief that promises to < help > families amid the pandemic .
The words “make”, “deal” and “help” are in both negative and positive sentiments and with high frequencies, so I decided to take a closer look at them.
The words “make” and “deal” are neutral terms (neither positive nor negative), so I will remove from both parts of the dictionary.
The word “help” has a clear positive sentiment, so I will remove it from the negative part of the dictionary only.
Analyzing the sentences where they were used,
freqs %>% as_tibble() %>% filter(feature %in% dictGI$negative)## # A tibble: 116 x 5
## feature frequency rank docfreq group
## <chr> <dbl> <int> <dbl> <chr>
## 1 tax 53 1 8 all
## 2 help 22 15 7 all
## 3 poverty 9 63 4 all
## 4 make 9 63 4 all
## 5 deal 8 74 3 all
## 6 cut 8 74 4 all
## 7 need 7 97 4 all
## 8 break 7 97 3 all
## 9 get 6 124 4 all
## 10 even 5 160 4 all
## # ... with 106 more rows
So, to try to improve the quality of this analysis, I will remove the three words from the dictionary. PS: One word can be part of different groups, this possibility is the main characteristic that differentiate this type of analysis from cluster analysis. However, the three words chosen in my analysis are more neutral words than positive and negative.
dictGI$positive <- setdiff(dictGI$positive, c("make","deal"))
dictGI$negative <- setdiff(dictGI$negative, c("make","deal","help"))Analyzing again the most frequent positive terms
freqs = textstat_frequency(relief.dtm)
freqs %>% as_tibble() %>% filter(feature %in% dictGI$positive)## # A tibble: 145 x 5
## feature frequency rank docfreq group
## <chr> <dbl> <int> <dbl> <chr>
## 1 rescue 24 10 8 all
## 2 aid 23 13 6 all
## 3 help 22 15 7 all
## 4 law 22 15 5 all
## 5 credit 20 21 6 all
## 6 support 15 36 4 all
## 7 education 10 57 2 all
## 8 pay 8 74 3 all
## 9 back 7 97 3 all
## 10 major 7 97 4 all
## # ... with 135 more rows
and negatives
freqs = textstat_frequency(relief.dtm)
freqs %>% as_tibble() %>% filter(feature %in% dictGI$negative)## # A tibble: 113 x 5
## feature frequency rank docfreq group
## <chr> <dbl> <int> <dbl> <chr>
## 1 tax 53 1 8 all
## 2 poverty 9 63 4 all
## 3 cut 8 74 4 all
## 4 need 7 97 4 all
## 5 break 7 97 3 all
## 6 get 6 124 4 all
## 7 even 5 160 4 all
## 8 inflation 3 289 2 all
## 9 hit 3 289 3 all
## 10 poor 3 289 2 all
## # ... with 103 more rows
Combining the dtm with the updated dictionary
res_dict_GI = relief.dtm %>%
dfm_lookup(dictGI) %>%
convert(to = "data.frame") %>%
as_tibble
res_dict_GI = res_dict_GI %>% mutate(length=ntoken(relief.dtm))
res_dict_GI = res_dict_GI %>% mutate(news=docvars(relief.corpus, "newschannel"))
res_dict_GI## # A tibble: 9 x 5
## doc_id negative positive length news
## <chr> <dbl> <dbl> <int> <chr>
## 1 bbc.txt 36 70 575 bbc
## 2 cnbc.txt 12 20 281 cnbc
## 3 cnn.txt 12 19 323 cnn
## 4 educationweek.txt 6 35 294 educationweek
## 5 foxbusiness.txt 25 14 272 foxbusiness
## 6 thefiscaltimes.txt 38 35 459 thefiscaltimes
## 7 thesoapbox.txt 59 102 966 thesoapbox
## 8 usatodaynews.txt 11 54 454 usatodaynews
## 9 washingtonpost.txt 41 87 992 washingtonpost
The 10 most frequent terms positives and negatives are now more distant from each other.
4.4 Creating some ratios to compare the overall differences among news
res_dict_GI = res_dict_GI %>% # Better when closer to zero.
mutate(sentiment_only=(positive - negative) / (positive + negative))
res_dict_GI = res_dict_GI %>% # The higher the better.
mutate(dict_compatibility=(positive + negative) / length)
res_dict_GI = res_dict_GI %>% # Combine both ratios above. The higher the better
mutate(overall_ratio = dict_compatibility/abs(sentiment_only))
res_dict_GI## # A tibble: 9 x 8
## doc_id negative positive length news sentiment_only dict_compatibil~
## <chr> <dbl> <dbl> <int> <chr> <dbl> <dbl>
## 1 bbc.txt 36 70 575 bbc 0.321 0.184
## 2 cnbc.txt 12 20 281 cnbc 0.25 0.114
## 3 cnn.txt 12 19 323 cnn 0.226 0.0960
## 4 educationw~ 6 35 294 educatio~ 0.707 0.139
## 5 foxbusines~ 25 14 272 foxbusin~ -0.282 0.143
## 6 thefiscalt~ 38 35 459 thefisca~ -0.0411 0.159
## 7 thesoapbox~ 59 102 966 thesoapb~ 0.267 0.167
## 8 usatodayne~ 11 54 454 usatoday~ 0.662 0.143
## 9 washington~ 41 87 992 washingt~ 0.359 0.129
## # ... with 1 more variable: overall_ratio <dbl>
5 Result Analysis and Conclusion
5.1 Overall Sentiment of each News
res_dict_GI %>%
ggplot(aes(x = news, y = sentiment_only, fill = sentiment_only)) +
geom_col() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(size=rel(2), angle = 45, vjust = 1, hjust = 1))In this plot thefiscaltimes followed by cnn abd cnbc have the better results because they are the ones most close to zero, which indicates a good balance of sentiment.
The foxbusiness and thefiscaltimes are the only ones with a overall negative sentiment towards the relief plan, and the educationweek and usatodaynews are the ones with strongest positive sentiment towards the relief plan. However, being positive or negative in this graph only indicates the overall sentiment of the news. The educationweek may have some very strong critics, but most of the text has a positive sentiment about the relief plan, so it scored very poorly here.
5.2 Variance of of Sentiment
res_dict_GI %>%
ggplot(aes(fct_reorder(news,dict_compatibility) , y = dict_compatibility, fill = dict_compatibility)) +
geom_col() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(size=rel(2), angle = 45, vjust = 1, hjust = 1))This ratio indicates how much was the variance between sentiments (positive x negative). In this case, the higher value the better, which indicates that the news mentioned topics in opposite sides. the variation of cnn was very poor because the its variation was low. This low variation is also related to the adequacy of the news with the dictionary used. cnn may used some unusual positive and negative terms that was not captured when compared with the dictionary used here. To increase the the reliability of these results, it would be necessary to compare these news with other dictionaries.
5.3 Final Rank of News
res_dict_GI %>%
ggplot(aes(fct_reorder(news,overall_ratio), overall_ratio, fill = overall_ratio)) +
geom_col() +
coord_flip() +
theme(legend.position = "none") +
theme(axis.text.y = element_text(size=rel(2), angle = 45),
axis.text.x = element_text(size=rel(2)))The final rank is a combination of the overall sentiment with the variance of sentiment; in this case, the higher, the better.
The thefiscaltimes news had the best score in Overall Sentiment and was the third in the variance. These results put the thefiscaltimes in the first place, with a good combination of sentiments balance and high variance of topics.
5.4 Directions for Future Improvements
This project’s ultimate goal is to create a raking of online newspapers to give the reader a tool to select the best online newspapers in terms of a broad view of political facts.
To enable the online newspaper ranking, I would consider four main tasks.
The first one is the creation of a code to find and download news from the internet automatically. This code should be able to find news based on keywords.
The second one is creating a list of significant political facts and their respective keywords to find the news related to it.
The third one is creating one dictionary based on a combination of already existing dictionaries to increase the reliability of this project.
Finally, combine all the results to form a reliable ranking of political news.
6 References
R files of professor Armando Rodriguez, University of New Haven.
https://www.youtube.com/watch?v=U0l5GB0i3uU
https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/sentiment_analysis.md