Natural language Processing in R: Sentiment Analysis of Kenya’s Star Newspaper on Saturday July 16, 2022

Background

In this analysis, I scrap data from the Star Newspaper for Saturday July 16, 2022 and evaluate the sentiment and topics that dominate the news.

Data

The data will include all the article titles. I start by scrapping the data.

star_titles <- read_html("https://www.the-star.co.ke/") %>% 
  
  html_nodes(".article-card-title") %>% 
  
  html_text() %>% 
  
  tibble() %>% 
  
  set_names("my_titles") %>% 
  
  mutate(id = row_number()) %>% 
  
  relocate(id) %>% 
  
  filter(!str_detect(my_titles, "[Rr]ecipe")) %>% 
  
  mutate(my_titles = str_remove(my_titles, "^.*:"))

We view the first few news stories of the newspaper.

head(star_titles) %>% 
  
  kbl(., booktabs = TRUE, caption = "Top News Stories") %>% 
  
  kable_classic(bootstrap_options = "striped")

Top News Stories
id	my_titles
1	What Supreme Court ruling means for Sonko
2	Sonko out! Supreme Court upholds impeachment
3	What we know about plans to rig elections
4	7 reasons why Supreme Court upheld Sonko’s impeachment
5	I’m not suffering from cancer - Paul Muite
6	Uhuru’s last minute game

Next, we examine the most commonly occuring words in the news today. Here, we use the tidy data principles, converting every word to a row of data, as follows.

star_titles %>% 
  
  unnest_tokens(word, my_titles) %>% 
  
  head(7) %>% 
  
  kbl(., booktabs = TRUE, caption = "Unnesting Words") %>% 
  
  kable_classic(bootstrap_options = "striped")

Unnesting Words
id	word
1	what
1	supreme
1	court
1	ruling
1	means
1	for
1	sonko

However, some words while useful in writing, do not carry inherent meaning. The words tend to dominate most of our writing. Let us see the most common words in our text.

star_titles %>% 
  
  unnest_tokens(word, my_titles) %>% 
  
  count(word) %>% 
  
  arrange(desc(n))

## # A tibble: 422 × 2
##    word      n
##    <chr> <int>
##  1 in       24
##  2 to       24
##  3 for      12
##  4 and       9
##  5 kenya     9
##  6 of        9
##  7 the       9
##  8 why       7
##  9 with      7
## 10 world     7
## # … with 412 more rows

We see that to, in and for are the most common words. However, they have little meaning on their own. Hence, we shall remove such words to retain meaningful words. Fortunately, R provides tools to remove stopwords.

star_titles %>% 
  
  unnest_tokens(word, my_titles) %>% 
  
  anti_join(stop_words) %>% 
  
  count(word) %>% 
  
  arrange(desc(n)) %>% 
  
  head(7)

## # A tibble: 7 × 2
##   word              n
##   <chr>         <int>
## 1 kenya             9
## 2 world             7
## 3 2025              5
## 4 raila             5
## 5 tasty             5
## 6 boost             4
## 7 championships     4

Word Frequency

In this section, we examine the words that appear most frequently in the data set. Wr visualise this data using a column chart and a word cloud. We start with the column chart of the top 10 most common words in todays newspaper titles.

star_titles %>% 
  
  unnest_tokens(word, my_titles) %>% 
  
  anti_join(stop_words) %>% 
  
  count(word) %>% 
  
  arrange(desc(n)) %>% 
  
  slice(1:10) %>% 
  
  mutate(word2 = fct_reorder(word, n)) %>% 
  
  ggplot(aes(x = word2, y = n)) + 
  
  geom_col() + 
  
  coord_flip() + 
  
  labs(
    
    title = "Analysis of Star Newspaper Titles, July 15, 2022",
    
    subtitle = "Which Words or Names are most Frequently Appearing in the News?",
    
    x = NULL,
    
    y = NULL
  )

The word cloud gives an even better shot of the common words.

word_freq <- star_titles %>% 
  
  unnest_tokens(word, my_titles) %>% 
  
  anti_join(stop_words) %>% 
  
  count(word)

wordcloud(words = word_freq$word, freq = word_freq$n, 
          
          min.freq = 3, col = 'red')

There appears to be quite a furore over the World Championships (Omanyala never got a visa in time), land (presumably the Kenyatta University land saga), and, understandably Raila (Odinga) and (William) Ruto.

Getting Average Sentiment

In this section, we estimate the average sentiment or emotional content of words in the Star newspaper. There are several tools that allow for the estimation of sentiment. In R, the common sentiment analysis dictionaries are:

Bing
Afinn
Loughran
NRC

Please refere to the literature for each of these sentiment measures. In this case, we use the nrc dictionary. The nrc dictionary has 10 classes of sentiment listed below. Most English words have carefully been allocated to each of these sentiments. However, there is still room for error. Also, people may use words that may not neccesarily correspond with the sentiment. A case in point is when using sarcasm. This occurence is relatively rare in written speech.

get_sentiments("nrc") %>% 
  
  count(sentiment) %>% 
  
  pull(sentiment)

##  [1] "anger"        "anticipation" "disgust"      "fear"         "joy"         
##  [6] "negative"     "positive"     "sadness"      "surprise"     "trust"

We join the nrc dictionary with our data so that each word in our data set has a corresponding sentiment.

Overall, the news appears to be positive. This conclusion could change if we analysed political news alone.

star_titles %>% 
  
  unnest_tokens(word, my_titles) %>% 
  
  anti_join(stop_words) %>%
  
  inner_join(get_sentiments("nrc")) %>% 
  
  filter(sentiment %in% c("positive", "negative")) %>% 
  
  count(sentiment) %>% 
  
  arrange(desc(n)) %>% 
  
  mutate(prop = n / sum(n))

## # A tibble: 2 × 3
##   sentiment     n  prop
##   <chr>     <int> <dbl>
## 1 positive     61 0.663
## 2 negative     31 0.337

Topic Modelling

Approximately how many topics are covered in the Star Newspaper today. This question is very subjective. However, we can estimate this using a technique called the Latent Dirichlet allocation (LDA). Please refer to https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation for more details on the LDA technique.

We start by creating a document term matrix:

star_dtm <- star_titles %>% 
  
  unnest_tokens(word, my_titles) %>% 
  
  anti_join(stop_words) %>%
  
  count(id, word) %>% 
  
  mutate(id = as.character(id)) %>% 
  
  cast_dtm(term = word, document = id, value = n)
  
glimpse(star_dtm)

## List of 6
##  $ i       : int [1:462] 1 2 4 1 1 1 2 1 2 4 ...
##  $ j       : int [1:462] 1 1 1 2 3 4 4 5 5 5 ...
##  $ v       : num [1:462] 1 1 1 1 1 1 1 1 1 1 ...
##  $ nrow    : int 91
##  $ ncol    : int 340
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:91] "1" "2" "3" "4" ...
##   ..$ Terms: chr [1:340] "court" "means" "ruling" "sonko" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

We then run an LDA analysis, as follows.

my_star_lda <- LDA(
  star_dtm,
  k = 3,
  method = "Gibbs",
  control = list(seed = 42)
)

glimpse(my_star_lda)

## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
##   ..@ seedwords      : NULL
##   ..@ z              : int [1:464] 2 1 2 2 2 2 2 1 1 2 ...
##   ..@ alpha          : num 16.7
##   ..@ call           : language LDA(x = star_dtm, k = 3, method = "Gibbs", control = list(seed = 42))
##   ..@ Dim            : int [1:2] 91 340
##   ..@ control        :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
##   ..@ k              : int 3
##   ..@ terms          : chr [1:340] "court" "means" "ruling" "sonko" ...
##   ..@ documents      : chr [1:91] "1" "2" "3" "4" ...
##   ..@ beta           : num [1:3, 1:340] -7.47 -4.09 -7.63 -5.08 -7.52 ...
##   ..@ gamma          : num [1:91, 1:3] 0.321 0.339 0.333 0.327 0.309 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:462] 1 1 1 1 1 2 2 2 2 2 ...
##   .. ..$ j   : int [1:462] 1 2 3 4 5 1 4 5 6 7 ...
##   .. ..$ v   : num [1:462] 2 1 2 2 1 2 2 1 1 2 ...
##   .. ..$ nrow: int 91
##   .. ..$ ncol: int 340
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num -2934
##   ..@ iter           : int 2000
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 464

my_star_lda_tidy <- tidy(my_star_lda, matrix = "beta") %>% arrange(desc(beta))

my_star_lda_tidy %>% 
  
  group_by(topic) %>% 
  
  top_n(3, beta) %>% 
  
  mutate(term2 = fct_reorder(term, beta)) %>% 
  
  ggplot(aes(x = term, y = beta, fill = factor(topic))) + 
  
  geom_col() + 
  
  facet_wrap(~ topic, scales = "free") + 
  
  coord_flip()

However, deciding the optimal number of topics is subjective. In our case, we appear to have a topic on politics, another topic on world issues and yet another dealing with women and general issues.

Conclusion

In this article, we have examined the basics of natural language processing using R. The write up covered sentiment analysis and the basics of topic modelling. You can find more about these courses on datacamp or from a book by Julie Silge titled Text Mining with R.