Natural language Processing in R: Sentiment Analysis of Kenya’s Star Newspaper on Saturday July 16, 2022
Background
In this analysis, I scrap data from the Star Newspaper for Saturday July 16, 2022 and evaluate the sentiment and topics that dominate the news.
Data
The data will include all the article titles. I start by scrapping the data.
<- read_html("https://www.the-star.co.ke/") %>%
star_titles
html_nodes(".article-card-title") %>%
html_text() %>%
tibble() %>%
set_names("my_titles") %>%
mutate(id = row_number()) %>%
relocate(id) %>%
filter(!str_detect(my_titles, "[Rr]ecipe")) %>%
mutate(my_titles = str_remove(my_titles, "^.*:"))
We view the first few news stories of the newspaper.
head(star_titles) %>%
kbl(., booktabs = TRUE, caption = "Top News Stories") %>%
kable_classic(bootstrap_options = "striped")
id | my_titles |
---|---|
1 | What Supreme Court ruling means for Sonko |
2 | Sonko out! Supreme Court upholds impeachment |
3 | What we know about plans to rig elections |
4 | 7 reasons why Supreme Court upheld Sonko’s impeachment |
5 | I’m not suffering from cancer - Paul Muite |
6 | Uhuru’s last minute game |
Next, we examine the most commonly occuring words in the news today. Here, we use the tidy data principles, converting every word to a row of data, as follows.
%>%
star_titles
unnest_tokens(word, my_titles) %>%
head(7) %>%
kbl(., booktabs = TRUE, caption = "Unnesting Words") %>%
kable_classic(bootstrap_options = "striped")
id | word |
---|---|
1 | what |
1 | supreme |
1 | court |
1 | ruling |
1 | means |
1 | for |
1 | sonko |
However, some words while useful in writing, do not carry inherent meaning. The words tend to dominate most of our writing. Let us see the most common words in our text.
%>%
star_titles
unnest_tokens(word, my_titles) %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 422 × 2
## word n
## <chr> <int>
## 1 in 24
## 2 to 24
## 3 for 12
## 4 and 9
## 5 kenya 9
## 6 of 9
## 7 the 9
## 8 why 7
## 9 with 7
## 10 world 7
## # … with 412 more rows
We see that to
, in
and for
are
the most common words. However, they have little meaning on their own.
Hence, we shall remove such words to retain meaningful words.
Fortunately, R
provides tools to remove
stopwords
.
%>%
star_titles
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(word) %>%
arrange(desc(n)) %>%
head(7)
## # A tibble: 7 × 2
## word n
## <chr> <int>
## 1 kenya 9
## 2 world 7
## 3 2025 5
## 4 raila 5
## 5 tasty 5
## 6 boost 4
## 7 championships 4
Word Frequency
In this section, we examine the words that appear most frequently in the data set. Wr visualise this data using a column chart and a word cloud. We start with the column chart of the top 10 most common words in todays newspaper titles.
%>%
star_titles
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(word) %>%
arrange(desc(n)) %>%
slice(1:10) %>%
mutate(word2 = fct_reorder(word, n)) %>%
ggplot(aes(x = word2, y = n)) +
geom_col() +
coord_flip() +
labs(
title = "Analysis of Star Newspaper Titles, July 15, 2022",
subtitle = "Which Words or Names are most Frequently Appearing in the News?",
x = NULL,
y = NULL
)
The word cloud gives an even better shot of the common words.
<- star_titles %>%
word_freq
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(word)
wordcloud(words = word_freq$word, freq = word_freq$n,
min.freq = 3, col = 'red')
There appears to be quite a furore over the World Championships (Omanyala never got a visa in time), land (presumably the Kenyatta University land saga), and, understandably Raila (Odinga) and (William) Ruto.
Getting Average Sentiment
In this section, we estimate the average sentiment or emotional
content of words in the Star newspaper. There are several tools that
allow for the estimation of sentiment. In R
, the common
sentiment analysis dictionaries are:
- Bing
- Afinn
- Loughran
- NRC
Please refere to the literature for each of these sentiment measures.
In this case, we use the nrc
dictionary. The
nrc
dictionary has 10 classes of sentiment listed below.
Most English words have carefully been allocated to each of these
sentiments. However, there is still room for error. Also, people may use
words that may not neccesarily correspond with the sentiment. A case in
point is when using sarcasm. This occurence is relatively rare in
written speech.
get_sentiments("nrc") %>%
count(sentiment) %>%
pull(sentiment)
## [1] "anger" "anticipation" "disgust" "fear" "joy"
## [6] "negative" "positive" "sadness" "surprise" "trust"
We join the nrc
dictionary with our data so that each
word in our data set has a corresponding sentiment.
Overall, the news appears to be positive. This conclusion could change if we analysed political news alone.
%>%
star_titles
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
inner_join(get_sentiments("nrc")) %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(sentiment) %>%
arrange(desc(n)) %>%
mutate(prop = n / sum(n))
## # A tibble: 2 × 3
## sentiment n prop
## <chr> <int> <dbl>
## 1 positive 61 0.663
## 2 negative 31 0.337
Topic Modelling
Approximately how many topics are covered in the Star Newspaper today. This question is very subjective. However, we can estimate this using a technique called the Latent Dirichlet allocation (LDA). Please refer to https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation for more details on the LDA technique.
We start by creating a document term matrix:
<- star_titles %>%
star_dtm
unnest_tokens(word, my_titles) %>%
anti_join(stop_words) %>%
count(id, word) %>%
mutate(id = as.character(id)) %>%
cast_dtm(term = word, document = id, value = n)
glimpse(star_dtm)
## List of 6
## $ i : int [1:462] 1 2 4 1 1 1 2 1 2 4 ...
## $ j : int [1:462] 1 1 1 2 3 4 4 5 5 5 ...
## $ v : num [1:462] 1 1 1 1 1 1 1 1 1 1 ...
## $ nrow : int 91
## $ ncol : int 340
## $ dimnames:List of 2
## ..$ Docs : chr [1:91] "1" "2" "3" "4" ...
## ..$ Terms: chr [1:340] "court" "means" "ruling" "sonko" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
We then run an LDA analysis, as follows.
<- LDA(
my_star_lda
star_dtm,k = 3,
method = "Gibbs",
control = list(seed = 42)
)
glimpse(my_star_lda)
## Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots
## ..@ seedwords : NULL
## ..@ z : int [1:464] 2 1 2 2 2 2 2 1 1 2 ...
## ..@ alpha : num 16.7
## ..@ call : language LDA(x = star_dtm, k = 3, method = "Gibbs", control = list(seed = 42))
## ..@ Dim : int [1:2] 91 340
## ..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] with 14 slots
## ..@ k : int 3
## ..@ terms : chr [1:340] "court" "means" "ruling" "sonko" ...
## ..@ documents : chr [1:91] "1" "2" "3" "4" ...
## ..@ beta : num [1:3, 1:340] -7.47 -4.09 -7.63 -5.08 -7.52 ...
## ..@ gamma : num [1:91, 1:3] 0.321 0.339 0.333 0.327 0.309 ...
## ..@ wordassignments:List of 5
## .. ..$ i : int [1:462] 1 1 1 1 1 2 2 2 2 2 ...
## .. ..$ j : int [1:462] 1 2 3 4 5 1 4 5 6 7 ...
## .. ..$ v : num [1:462] 2 1 2 2 1 2 2 1 1 2 ...
## .. ..$ nrow: int 91
## .. ..$ ncol: int 340
## .. ..- attr(*, "class")= chr "simple_triplet_matrix"
## ..@ loglikelihood : num -2934
## ..@ iter : int 2000
## ..@ logLiks : num(0)
## ..@ n : int 464
<- tidy(my_star_lda, matrix = "beta") %>% arrange(desc(beta)) my_star_lda_tidy
%>%
my_star_lda_tidy
group_by(topic) %>%
top_n(3, beta) %>%
mutate(term2 = fct_reorder(term, beta)) %>%
ggplot(aes(x = term, y = beta, fill = factor(topic))) +
geom_col() +
facet_wrap(~ topic, scales = "free") +
coord_flip()
However, deciding the optimal number of topics is subjective. In our case, we appear to have a topic on politics, another topic on world issues and yet another dealing with women and general issues.
Conclusion
In this article, we have examined the basics of natural language
processing using R. The write up covered sentiment analysis and the
basics of topic modelling. You can find more about these courses on datacamp or from a book by Julie
Silge titled Text Mining with R
.