Natural language Processing in R: Sentiment Analysis of Kenya’s Star Newspaper on Saturday July 16, 2022
Background
In this analysis, I scrap data from the Star Newspaper for Saturday July 16, 2022 and evaluate the sentiment and topics that dominate the news.
Data
The data will include all the article titles. I start by scrapping the data.
star_titles <- read_html("https://www.the-star.co.ke/") %>%
html_nodes(".article-card-title") %>%
html_text() %>%
tibble() %>%
set_names("my_titles") %>%
mutate(id = row_number()) %>%
relocate(id) %>%
filter(!str_detect(my_titles, "[Rr]ecipe")) %>%
mutate(my_titles = str_remove(my_titles, "^.*:"))We view the first few news stories of the newspaper.
head(star_titles) %>%
kbl(., booktabs = TRUE, caption = "Top News Stories") %>%
kable_classic(bootstrap_options = "striped")| id | my_titles |
|---|---|
| 1 | What Supreme Court ruling means for Sonko |
| 2 | Sonko out! Supreme Court upholds impeachment |
| 3 | What we know about plans to rig elections |
| 4 | 7 reasons why Supreme Court upheld Sonko’s impeachment |
| 5 | I’m not suffering from cancer - Paul Muite |
| 6 | Uhuru’s last minute game |
Next, we examine the most commonly occuring words in the news today. Here, we use the tidy data principles, converting every word to a row of data, as follows.
star_titles %>%
unnest_tokens(word, my_titles) %>%
head(7) %>%
kbl(., booktabs = TRUE, caption = "Unnesting Words") %>%
kable_classic(bootstrap_options = "striped")| id | word |
|---|---|
| 1 | what |
| 1 | supreme |
| 1 | court |
| 1 | ruling |
| 1 | means |
| 1 | for |
| 1 | sonko |
However, some words while useful in writing, do not carry inherent meaning. The words tend to dominate most of our writing. Let us see the most common words in our text.
star_titles %>%
unnest_tokens(word, my_titles) %>%
count(word) %>%
arrange(desc(n))## # A tibble: 422 × 2
## word n
## <chr> <int>
## 1 in 24
## 2 to 24
## 3 for 12
## 4 and 9
## 5 kenya 9
## 6 of 9
## 7 the 9
## 8 why 7
## 9 with 7
## 10 world 7
## # … with 412 more rows
We see that to, in and for are
the most common words. However, they have little meaning on their own.
Hence, we shall remove such words to retain meaningful words.
Fortunately, R provides tools to remove
stopwords.
star_titles %>%
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(word) %>%
arrange(desc(n)) %>%
head(7)## # A tibble: 7 × 2
## word n
## <chr> <int>
## 1 kenya 9
## 2 world 7
## 3 2025 5
## 4 raila 5
## 5 tasty 5
## 6 boost 4
## 7 championships 4
Word Frequency
In this section, we examine the words that appear most frequently in the data set. Wr visualise this data using a column chart and a word cloud. We start with the column chart of the top 10 most common words in todays newspaper titles.
star_titles %>%
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(word) %>%
arrange(desc(n)) %>%
slice(1:10) %>%
mutate(word2 = fct_reorder(word, n)) %>%
ggplot(aes(x = word2, y = n)) +
geom_col() +
coord_flip() +
labs(
title = "Analysis of Star Newspaper Titles, July 15, 2022",
subtitle = "Which Words or Names are most Frequently Appearing in the News?",
x = NULL,
y = NULL
)The word cloud gives an even better shot of the common words.
word_freq <- star_titles %>%
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(word)
wordcloud(words = word_freq$word, freq = word_freq$n,
min.freq = 3, col = 'red')There appears to be quite a furore over the World Championships (Omanyala never got a visa in time), land (presumably the Kenyatta University land saga), and, understandably Raila (Odinga) and (William) Ruto.
Getting Average Sentiment
In this section, we estimate the average sentiment or emotional
content of words in the Star newspaper. There are several tools that
allow for the estimation of sentiment. In R, the common
sentiment analysis dictionaries are:
- Bing
- Afinn
- Loughran
- NRC
Please refere to the literature for each of these sentiment measures.
In this case, we use the nrc dictionary. The
nrc dictionary has 10 classes of sentiment listed below.
Most English words have carefully been allocated to each of these
sentiments. However, there is still room for error. Also, people may use
words that may not neccesarily correspond with the sentiment. A case in
point is when using sarcasm. This occurence is relatively rare in
written speech.
get_sentiments("nrc") %>%
count(sentiment) %>%
pull(sentiment)## [1] "anger" "anticipation" "disgust" "fear" "joy"
## [6] "negative" "positive" "sadness" "surprise" "trust"
We join the nrc dictionary with our data so that each
word in our data set has a corresponding sentiment.
Overall, the news appears to be positive. This conclusion could change if we analysed political news alone.
star_titles %>%
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment) %>%
arrange(desc(n)) %>%
mutate(prop = n / sum(n))## # A tibble: 2 × 3
## sentiment n prop
## <chr> <int> <dbl>
## 1 positive 61 0.663
## 2 negative 31 0.337
Topic Modelling
Approximately how many topics are covered in the Star Newspaper today. This question is very subjective. However, we can estimate this using a technique called the Latent Dirichlet allocation (LDA). Please refer to https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation for more details on the LDA technique.
We start by creating a document term matrix:
star_dtm <- star_titles %>%
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(id, word) %>%
mutate(id = as.character(id)) %>%
cast_dtm(term = word, document = id, value = n)
glimpse(star_dtm)## List of 6
## $ i : int [1:462] 1 2 4 1 1 1 2 1 2 4 ...
## $ j : int [1:462] 1 1 1 2 3 4 4 5 5 5 ...
## $ v : num [1:462] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 91
## $ ncol : int 340
## $ dimnames:List of 2
## ..$ Docs : chr [1:91] "1" "2" "3" "4" ...
## ..$ Terms: chr [1:340] "court" "means" "ruling" "sonko" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
We then run an LDA analysis, as follows.
my_star_lda <- LDA(
star_dtm,
k = 3,
method = "Gibbs",
control = list(seed = 42)
)
glimpse(my_star_lda)## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
## ..@ seedwords : NULL
## ..@ z : int [1:464] 2 1 2 2 2 2 2 1 1 2 ...
## ..@ alpha : num 16.7
## ..@ call : language LDA(x = star_dtm, k = 3, method = "Gibbs", control = list(seed = 42))
## ..@ Dim : int [1:2] 91 340
## ..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
## ..@ k : int 3
## ..@ terms : chr [1:340] "court" "means" "ruling" "sonko" ...
## ..@ documents : chr [1:91] "1" "2" "3" "4" ...
## ..@ beta : num [1:3, 1:340] -7.47 -4.09 -7.63 -5.08 -7.52 ...
## ..@ gamma : num [1:91, 1:3] 0.321 0.339 0.333 0.327 0.309 ...
## ..@ wordassignments:List of 5
## .. ..$ i : int [1:462] 1 1 1 1 1 2 2 2 2 2 ...
## .. ..$ j : int [1:462] 1 2 3 4 5 1 4 5 6 7 ...
## .. ..$ v : num [1:462] 2 1 2 2 1 2 2 1 1 2 ...
## .. ..$ nrow: int 91
## .. ..$ ncol: int 340
## .. ..- attr(*, "class")= chr "simple_triplet_matrix"
## ..@ loglikelihood : num -2934
## ..@ iter : int 2000
## ..@ logLiks : num(0)
## ..@ n : int 464
my_star_lda_tidy <- tidy(my_star_lda, matrix = "beta") %>% arrange(desc(beta))my_star_lda_tidy %>%
group_by(topic) %>%
top_n(3, beta) %>%
mutate(term2 = fct_reorder(term, beta)) %>%
ggplot(aes(x = term, y = beta, fill = factor(topic))) +
geom_col() +
facet_wrap(~ topic, scales = "free") +
coord_flip()However, deciding the optimal number of topics is subjective. In our case, we appear to have a topic on politics, another topic on world issues and yet another dealing with women and general issues.
Conclusion
In this article, we have examined the basics of natural language
processing using R. The write up covered sentiment analysis and the
basics of topic modelling. You can find more about these courses on datacamp or from a book by Julie
Silge titled Text Mining with R.