Online News Quality Assessment

1 Introduction

The biggest challenge of the so-called impartial media news is to remove the bias that is intrinsic to any journalist.

The journalist should always focus on facts (positive or negative) and try as much as possible to eliminate its own opinions. However, this task is very challenging because the description of facts will always be affected by how the journalist sees the world, his/her values, culture, etc.

Other than that, a political polarization phenomenon is apparently increasing in recent years. In many cases, it is possible to identify feelings like hate or devotion in political news.

This polarization phenomenon is making the mainstream media news lose its credibility, and create a problem for its audience, which don’t know what to trust anymore.

This project will address this issue by evaluating the political news’s quality using sentiment analysis and the two following assumptions.

Assumption 1: There is no unbiased news. Journalists will always use their world view, values, feelings, culture, religion, etc., to describe any fact.
Assumption 2: The best political news are the ones that consider the positive and negative aspects of any political fact.

This project aims to assess some political news’s sentiment level to evaluate their quality based on the two assumptions above. In other words, the news’s quality will be graded on the balance of sentiments, which should reflect positive and negative analysis of the political fact.

This project will start with a simple example of a fact, the Covid-19 Relief Plan recently approved by the US President.

Searching for the words “Biden covid relief plan” on Google News. I collected news from 9 different online newspapers to make the analysis of sentiment, to try to capture if the news described the fact deeming the positive and negative sides of it.

The ultimate goal of this project is to create a raking of online newspapers to give the reader a tool to select the best online newspapers in terms of a broad view of political facts.

PS: This project has no political side. The objective here is to test the NLP method’s capabilities identifying different sentiments based on different views of the same subject.

1.1 The news used in this project are the following:

2 Importing the data and creating a corpus

2.0.1 Setting some initial configuration.

Setting decimal values with 3 digits and avoiding any scientific numeric notation.

options(digits = 3, scipen = 9999)

2.1 Reading text from the news.

relief <- readtext("datasets/news/*.txt",
                   text_field = "text",
                   encoding = "UTF-8")

2.2 Creating a corpus and ordering it by Types (unique words in the text)

relief.corpus <- corpus(relief)
docvars(relief.corpus, "newschannel") <-
  sapply(strsplit(relief.corpus %>% names(), "\\."), "[", 1)
relief.corpus.summary <-
  summary(relief.corpus) %>% arrange(desc(Types))
str(relief.corpus.summary)

## Classes 'summary.corpus' and 'data.frame':   9 obs. of  5 variables:
##  $ Text       : chr  "thesoapbox.txt" "washingtonpost.txt" "bbc.txt" "thefiscaltimes.txt" ...
##  $ Types      : int  799 735 530 436 425 327 290 277 262
##  $ Tokens     : int  1908 1972 1180 966 940 660 577 527 539
##  $ Sentences  : int  55 62 44 35 38 22 18 19 24
##  $ newschannel: chr  "thesoapbox" "washingtonpost" "bbc" "thefiscaltimes" ...
##  - attr(*, "ndoc_all")= int 9
##  - attr(*, "ndoc_show")= int 9

2.3 Creating a Document Term Matrix

Creating a DTM using documents as units. This will allow the comparison of positive and negative sentiment in each doc.

relief.dtm = dfm(
  relief.corpus,
  tolower = T,
  remove = cbind(stopwords("en"),
                 c("biden", "american", "relief", "plan")),
  remove_punc = T,
  remove_numbers = T,
  remove_symbols = T
)
relief.dtm

## Document-feature matrix of: 9 documents, 1,874 features (81.8% sparse) and 1 docvar.
##                     features
## docs                 three reasons biden's covid bill big deal back
##   bbc.txt                2       1       6     7   14   3    4    2
##   cnbc.txt               0       0       0     1    4   0    0    0
##   cnn.txt                1       0       3     0    3   0    0    0
##   educationweek.txt      0       0       1     0    2   0    2    0
##   foxbusiness.txt        0       0       2     1    0   0    0    0
##   thefiscaltimes.txt     2       0       3     2    3   1    0    0
##                     features
## docs                 vice-president joe
##   bbc.txt                         2   1
##   cnbc.txt                        0   2
##   cnn.txt                         0   1
##   educationweek.txt               0   2
##   foxbusiness.txt                 0   0
##   thefiscaltimes.txt              0   0
## [ reached max_ndoc ... 3 more documents, reached max_nfeat ... 1,864 more features ]

3 Analyzing the News

3.1 Creating a cloud to visualize the most frequent words in all news combined

textplot_wordcloud(relief.dtm, max_words = 50, color = c("blue", "red"))

The most interesting terms in this cloud are “tax” and “bill”, which may show that the news are expressing some concern about how the relief plan will be paid.

3.2 Table with the (10) most used terms

tf <- textstat_frequency(relief.dtm, n=10)

tf %>% ggplot(aes(fct_reorder(feature, frequency),
                  frequency,
                  fill = frequency)
              ) +
  geom_col() + 
  coord_flip() +
  theme(axis.text.y
        = element_text(size=rel(2),
                       angle = 45,
                       vjust = 1,
                       hjust = 1))

3.3 Comparing corpora

Comparing frequencies

names <- docvars(relief.dtm)[["newschannel"]]
for (unit in seq_along(names)){
    presence = docvars(relief.dtm)[["newschannel"]] == names[unit]
    ts = textstat_keyness(relief.dtm, presence)
    t = textplot_keyness(ts, n =10)
    plot(t)
}

These comparisons above show some differences among the news. Three groups can be observed.

cnbc, cnn educationweek, usatodaynews are more interested in the positive effects of the relief plan
- bbc - bill covid support
- cnbc - vaccination, parade, legislation
- cnn - covid vaccines
- educationweek - education recovery stabilization
- usatodaynews - small business benefits obamacare
foxbusiness and thefiscaltimes are more interested in the possible negative side effects of the relief plan
- foxbusiness - taxable taxes unemployment
- thefiscaltimes - tax increases pay
thesoapbox and washingtonpost are more interested in the legal aspects of the relief plan
- thesoapbox - amendment, decision, attourney, court, justice
- washingtonpost - house, lawmakers, party

3.4 Testing Similarity

tstat_dist <- as.dist(textstat_dist(relief.dtm))
relief.clust <- hclust(tstat_dist)
plot(relief.clust)

This test was not very informative because it doesn’t clearly show distinct groups.

4 Analysis of Sentiment

4.1 Comparing two dictionaries

res_dict_quanteda = relief.dtm %>%
  dfm_lookup(data_dictionary_LSD2015[1:2]) %>% 
   convert(to = "data.frame") %>% 
  as_tibble

res_dict_quanteda = res_dict_quanteda %>% mutate(length=ntoken(relief.dtm))
res_dict_quanteda

## # A tibble: 9 x 4
##   doc_id             negative positive length
##   <chr>                 <dbl>    <dbl>  <int>
## 1 bbc.txt                  33       62    575
## 2 cnbc.txt                 10       25    281
## 3 cnn.txt                   7       16    323
## 4 educationweek.txt         5       44    294
## 5 foxbusiness.txt          11       16    272
## 6 thefiscaltimes.txt       15       22    459
## 7 thesoapbox.txt           66       60    966
## 8 usatodaynews.txt          9       58    454
## 9 washingtonpost.txt       42       79    992

dictGI = dictionary(DictionaryGI)
res_dict_GI = relief.dtm %>% 
  dfm_lookup(dictGI) %>% 
  convert(to = "data.frame") %>% 
  as_tibble

res_dict_GI = res_dict_GI %>% mutate(length=ntoken(relief.dtm))
res_dict_GI

## # A tibble: 9 x 4
##   doc_id             negative positive length
##   <chr>                 <dbl>    <dbl>  <int>
## 1 bbc.txt                  41       74    575
## 2 cnbc.txt                 14       21    281
## 3 cnn.txt                  13       19    323
## 4 educationweek.txt         9       37    294
## 5 foxbusiness.txt          25       14    272
## 6 thefiscaltimes.txt       43       38    459
## 7 thesoapbox.txt           64      107    966
## 8 usatodaynews.txt         19       56    454
## 9 washingtonpost.txt       51       87    992

The dictionary DictionaryGI is showing better results because It capturing more positive and negative words. So, I am going to proceed with the DictionaryGI.

4.2 Analyzing the most frequent terms (positives and negatives)

freqs = textstat_frequency(relief.dtm)
freqs %>% as_tibble() %>% filter(feature %in% dictGI$positive)

## # A tibble: 147 x 5
##    feature   frequency  rank docfreq group
##    <chr>         <dbl> <int>   <dbl> <chr>
##  1 rescue           24    10       8 all  
##  2 aid              23    13       6 all  
##  3 help             22    15       7 all  
##  4 law              22    15       5 all  
##  5 credit           20    21       6 all  
##  6 support          15    36       4 all  
##  7 education        10    57       2 all  
##  8 make              9    63       4 all  
##  9 deal              8    74       3 all  
## 10 pay               8    74       3 all  
## # ... with 137 more rows

4.3 Removing some neutral terms from the dictionary

as.data.frame(kwic(relief.corpus, "make")) %>% 
  transmute(make = paste(pre, " < ", keyword, " > ", post))

##                                                                                make
## 1        will also boost provisions to  <  make  >  health care more affordable and
## 2                clear that they want to  <  make  >  permanent key elements , like
## 3                and say they want to  <  make  >  sure that additional spending is
## 4                       to twice that - would  <  make  >  it impossible to pay for
## 5        executive only the responsibility to  <  make  >  … factual findings … and
## 6                 observed , that formula would  <  make  >  " most of Government …
## 7      hyperbolic . Federal agencies routinely  <  make  >  " policy judgments " to
## 8 testing , modifying classrooms to  <  make  >  them safer , improving ventilation
## 9         expands eligibility to families who  <  make  >  no or very little income

as.data.frame(kwic(relief.corpus, "deal")) %>% 
  transmute(deal = paste(pre, " < ", keyword, " > ", post))

##                                                                                   deal
## 1                                 Covid bill is a big  <  deal  >  Back in 2010 , then
## 2              to emphasise how big a  <  deal  >  he thought congressional passage of
## 3                             has his own big congressional  <  deal  >  - a $ 1.9tn (
## 4 or Franklin Roosevelt's Depression-era New  <  Deal  >  programmes in size and scope
## 5                   What the Huge COVID-19 Aid  <  Deal  >  Biden Has Signed Means for
## 6              to help students and educators  <  deal  >  with the various impacts of
## 7                            " the terms of the  <  deal  >  . In Senate testimony and
## 8                       mandates . Since the New  <  Deal  >  , this doctrine has been

as.data.frame(kwic(relief.corpus, "help")) %>% 
  transmute(help = paste(pre, " < ", keyword, " > ", post))

##                                                                                       help
## 1                         which the legislation will also  <  help  >  fund - the US could
## 2                  and expand tax credits to  <  help  >  businesses keep employees on the
## 3                dollars for K-12 schools to  <  help  >  students return to the classroom
## 4             approximately $ 129 billion to  <  help  >  students and educators deal with
## 5                 tax increases Democrats picked to  <  help  >  keep their plan's cost in
## 6                        that could be proposed to  <  help  >  pay for those future plans
## 7        , extend unemployment benefits and  <  help  >  reopen schools . Small businesses
## 8   capacity ; improving technology to  <  help  >  disadvantaged students , and providing
## 9                         Plan continues earlier efforts to  <  help  >  small - but key -
## 10            the Paycheck Protection Program to  <  help  >  small business . This builds
## 11                         The money is designed to  <  help  >  small landlords as well .
## 12                           17 for one year to  <  help  >  combat the economic damage of
## 13 increase in anti-poverty programs to  <  help  >  millions of families still struggling
## 14       bill approves additional money to  <  help  >  schools reopen , allow restaurants
## 15                doubles as an attempt to  <  help  >  Americans who were struggling long
## 16                                  promised relief , and now  <  help  >  is on the way ,
## 17            expansion to federal programs that  <  help  >  Americans afford food in the
## 18                    $ 7 billion effort to  <  help  >  students obtain Internet access .
## 19                               start seeing some of the  <  help  >  show up in her bank
## 20             1,400 stimulus check , would  <  help  >  her cover much-needed car repairs
## 21                                      is going to do is  <  help  >  people catch up , "
## 22           significant relief that promises to  <  help  >  families amid the pandemic .

The words “make”, “deal” and “help” are in both negative and positive sentiments and with high frequencies, so I decided to take a closer look at them.

The words “make” and “deal” are neutral terms (neither positive nor negative), so I will remove from both parts of the dictionary.

The word “help” has a clear positive sentiment, so I will remove it from the negative part of the dictionary only.

Analyzing the sentences where they were used,

freqs %>% as_tibble() %>% filter(feature %in% dictGI$negative)

## # A tibble: 116 x 5
##    feature frequency  rank docfreq group
##    <chr>       <dbl> <int>   <dbl> <chr>
##  1 tax            53     1       8 all  
##  2 help           22    15       7 all  
##  3 poverty         9    63       4 all  
##  4 make            9    63       4 all  
##  5 deal            8    74       3 all  
##  6 cut             8    74       4 all  
##  7 need            7    97       4 all  
##  8 break           7    97       3 all  
##  9 get             6   124       4 all  
## 10 even            5   160       4 all  
## # ... with 106 more rows

So, to try to improve the quality of this analysis, I will remove the three words from the dictionary. PS: One word can be part of different groups, this possibility is the main characteristic that differentiate this type of analysis from cluster analysis. However, the three words chosen in my analysis are more neutral words than positive and negative.

dictGI$positive <- setdiff(dictGI$positive, c("make","deal"))
dictGI$negative <- setdiff(dictGI$negative, c("make","deal","help"))

Analyzing again the most frequent positive terms

freqs = textstat_frequency(relief.dtm)
freqs %>% as_tibble() %>% filter(feature %in% dictGI$positive)

## # A tibble: 145 x 5
##    feature   frequency  rank docfreq group
##    <chr>         <dbl> <int>   <dbl> <chr>
##  1 rescue           24    10       8 all  
##  2 aid              23    13       6 all  
##  3 help             22    15       7 all  
##  4 law              22    15       5 all  
##  5 credit           20    21       6 all  
##  6 support          15    36       4 all  
##  7 education        10    57       2 all  
##  8 pay               8    74       3 all  
##  9 back              7    97       3 all  
## 10 major             7    97       4 all  
## # ... with 135 more rows

and negatives

freqs = textstat_frequency(relief.dtm)
freqs %>% as_tibble() %>% filter(feature %in% dictGI$negative)

## # A tibble: 113 x 5
##    feature   frequency  rank docfreq group
##    <chr>         <dbl> <int>   <dbl> <chr>
##  1 tax              53     1       8 all  
##  2 poverty           9    63       4 all  
##  3 cut               8    74       4 all  
##  4 need              7    97       4 all  
##  5 break             7    97       3 all  
##  6 get               6   124       4 all  
##  7 even              5   160       4 all  
##  8 inflation         3   289       2 all  
##  9 hit               3   289       3 all  
## 10 poor              3   289       2 all  
## # ... with 103 more rows

Combining the dtm with the updated dictionary

res_dict_GI = relief.dtm %>% 
  dfm_lookup(dictGI) %>% 
  convert(to = "data.frame") %>% 
  as_tibble

res_dict_GI = res_dict_GI %>% mutate(length=ntoken(relief.dtm))
res_dict_GI = res_dict_GI %>% mutate(news=docvars(relief.corpus, "newschannel"))
res_dict_GI

## # A tibble: 9 x 5
##   doc_id             negative positive length news          
##   <chr>                 <dbl>    <dbl>  <int> <chr>         
## 1 bbc.txt                  36       70    575 bbc           
## 2 cnbc.txt                 12       20    281 cnbc          
## 3 cnn.txt                  12       19    323 cnn           
## 4 educationweek.txt         6       35    294 educationweek 
## 5 foxbusiness.txt          25       14    272 foxbusiness   
## 6 thefiscaltimes.txt       38       35    459 thefiscaltimes
## 7 thesoapbox.txt           59      102    966 thesoapbox    
## 8 usatodaynews.txt         11       54    454 usatodaynews  
## 9 washingtonpost.txt       41       87    992 washingtonpost

The 10 most frequent terms positives and negatives are now more distant from each other.

4.4 Creating some ratios to compare the overall differences among news

res_dict_GI = res_dict_GI %>% # Better when closer to zero.
  mutate(sentiment_only=(positive - negative) / (positive + negative)) 

res_dict_GI = res_dict_GI %>% # The higher the better.
  mutate(dict_compatibility=(positive + negative) / length) 

res_dict_GI = res_dict_GI %>% # Combine both ratios above. The higher the better
  mutate(overall_ratio = dict_compatibility/abs(sentiment_only)) 

res_dict_GI

## # A tibble: 9 x 8
##   doc_id      negative positive length news      sentiment_only dict_compatibil~
##   <chr>          <dbl>    <dbl>  <int> <chr>              <dbl>            <dbl>
## 1 bbc.txt           36       70    575 bbc               0.321            0.184 
## 2 cnbc.txt          12       20    281 cnbc              0.25             0.114 
## 3 cnn.txt           12       19    323 cnn               0.226            0.0960
## 4 educationw~        6       35    294 educatio~         0.707            0.139 
## 5 foxbusines~       25       14    272 foxbusin~        -0.282            0.143 
## 6 thefiscalt~       38       35    459 thefisca~        -0.0411           0.159 
## 7 thesoapbox~       59      102    966 thesoapb~         0.267            0.167 
## 8 usatodayne~       11       54    454 usatoday~         0.662            0.143 
## 9 washington~       41       87    992 washingt~         0.359            0.129 
## # ... with 1 more variable: overall_ratio <dbl>

5 Result Analysis and Conclusion

5.1 Overall Sentiment of each News

res_dict_GI %>%
  ggplot(aes(x = news, y = sentiment_only, fill = sentiment_only)) +
  geom_col() +
  theme(legend.position = "none") + 
  theme(axis.text.x = element_text(size=rel(2), angle = 45, vjust = 1, hjust = 1))

In this plot thefiscaltimes followed by cnn abd cnbc have the better results because they are the ones most close to zero, which indicates a good balance of sentiment.

The foxbusiness and thefiscaltimes are the only ones with a overall negative sentiment towards the relief plan, and the educationweek and usatodaynews are the ones with strongest positive sentiment towards the relief plan. However, being positive or negative in this graph only indicates the overall sentiment of the news. The educationweek may have some very strong critics, but most of the text has a positive sentiment about the relief plan, so it scored very poorly here.

5.2 Variance of of Sentiment

res_dict_GI %>%
  ggplot(aes(fct_reorder(news,dict_compatibility) , y = dict_compatibility, fill = dict_compatibility)) +
  geom_col() +
  theme(legend.position = "none") + 
  theme(axis.text.x = element_text(size=rel(2), angle = 45, vjust = 1, hjust = 1))

This ratio indicates how much was the variance between sentiments (positive x negative). In this case, the higher value the better, which indicates that the news mentioned topics in opposite sides. the variation of cnn was very poor because the its variation was low. This low variation is also related to the adequacy of the news with the dictionary used. cnn may used some unusual positive and negative terms that was not captured when compared with the dictionary used here. To increase the the reliability of these results, it would be necessary to compare these news with other dictionaries.

5.3 Final Rank of News

res_dict_GI %>% 
  ggplot(aes(fct_reorder(news,overall_ratio), overall_ratio, fill = overall_ratio)) +
  geom_col() + 
  coord_flip() +
  theme(legend.position = "none") + 
  theme(axis.text.y = element_text(size=rel(2), angle = 45),
        axis.text.x = element_text(size=rel(2)))

The final rank is a combination of the overall sentiment with the variance of sentiment; in this case, the higher, the better.

The thefiscaltimes news had the best score in Overall Sentiment and was the third in the variance. These results put the thefiscaltimes in the first place, with a good combination of sentiments balance and high variance of topics.

5.4 Directions for Future Improvements

This project’s ultimate goal is to create a raking of online newspapers to give the reader a tool to select the best online newspapers in terms of a broad view of political facts.

To enable the online newspaper ranking, I would consider four main tasks.

The first one is the creation of a code to find and download news from the internet automatically. This code should be able to find news based on keywords.

The second one is creating a list of significant political facts and their respective keywords to find the news related to it.

The third one is creating one dictionary based on a combination of already existing dictionaries to increase the reliability of this project.

Finally, combine all the results to form a reliable ranking of political news.

6 References

R files of professor Armando Rodriguez, University of New Haven.

https://www.youtube.com/watch?v=U0l5GB0i3uU

https://github.com/ccs-amsterdam/r-course-material/blob/master/tutorials/sentiment_analysis.md

https://tutorials.quanteda.io/