this data set is from https://businessday.ng/news/legal-business/article/full-text-of-inaugural-speech-by-president-tinubu/. The first step in analyzing the entire address corpus is to load all your libraries needed for the analysis. this will be followed by loading your data set using the path of your file. However, one can have a glimpse into what the focus of this new administration, whose campaign theme was based on hope, would be like.
Load required package needed
#install.packages('tidyverse') #only install packages once
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.1 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
#install.packages('tidytext') #only install packages once
library(tidytext)
#install.packages('SnowballC') #only install packages once
library(SnowballC)
#install.packages('wordcloud') #only install packages once
library(wordcloud)
## Loading required package: RColorBrewer
#install.packages('Rcpp') #actually, this package may need to be updated
library(Rcpp)
file_path <- (r"(C:\Users\ebene\Desktop\text analysis\PresidentTinubu.txt)")
file_size <-file.info(r'(C:\Users\ebene\Desktop\text analysis\PresidentTinubu.txt)')$size
tinu<- readChar(file_path, file_size)
# Printing number of columns
length(tinu)
## [1] 1
# explore
summary(tinu)
## Length Class Mode
## 1 character character
lets extracts a substring from the tinu data from text number 1 up to 400th the second code specified the package used
# view
tinu %>% str_sub(1, 400)
## [1] "My Fellow Citizens,\r\n\r\nI stand before you honoured to assume the sacred mandate you have given me. My love for this nation is abiding. My confidence in its people, unwavering. And my faith in God Almighty, absolute. I know that His hand shall provide the needed moral strength and clarity of purpose in those instances when we seem to have reached the limits of our human capacity.\r\n\r\nThis day is bol"
# tinu %>% stringr::str_sub(1, 400)
‘’ the symbol in the text It is a special sequence of characters that is used to indicate the end of a line and the start of a new line in many programming languages and text formats.
when the list output is true it tells us we have the key word, when it is false, the key word is missing. the second code tells us the number of matches we got in there. the third code tells us where the key words where mention and the last code help us to see the surrounding text around the key words.
# detect keywords
tinu %>% stringr::str_detect(c('Democracy', 'democracy', 'Security', 'security', 'Nation', 'nation', 'Unity', 'unity', 'Diversity', 'diversity','Agriculture', 'agriculture','Terrorism', 'terrorism', 'Job', 'job', 'Education', 'education'))
## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
## [13] FALSE FALSE TRUE TRUE FALSE TRUE
# count the number of matches of a substring
#Security, Diversity, Job, Agriculture, Unity, all plays a crucial role in the development and the existence of a nation
tinu %>% stringr::str_count("Security")
## [1] 1
tinu %>% stringr::str_count("security")
## [1] 4
tinu %>% stringr::str_count("Diversity")
## [1] 0
tinu %>% stringr::str_count("diversity")
## [1] 1
tinu %>% stringr::str_count("Job")
## [1] 1
tinu %>% stringr::str_count("job")
## [1] 4
tinu %>% stringr::str_count("Education")
## [1] 0
tinu %>% stringr::str_count("education")
## [1] 1
sum of all the key words listed cap and uncap: Security:5, diversity: 1, job: 5 and education: 1.
# Where is this keyword mentioned?
tinu %>% stringr::str_locate_all('job')
## [[1]]
## start end
## [1,] 6547 6549
## [2,] 8396 8398
## [3,] 9987 9989
## [4,] 10270 10272
# View surrounding text (requires regex)
tinu %>% stringr::str_extract_all(".{50}(job).{50}")
## [[1]]
## [1] "omy to bring about growth and development through job creation, food security and an end of extreme pov"
## [2] "public infrastructure, education, health care and jobs that will materially improve the lives of millio"
the {50} specifies the exact number of repetitions, it specifies the preceding pattern should be repeated exactly 50 times. The dot ‘.’ matches any character So, . {50} matches any character 50 times. (job) is a capturing group that matches the specific sequence “job”. Finally, .{50} matches any character 50 times again. Therefore, the overall regular expression .{50}(job).{50} matches any sequence of 50 characters before and after the word “job”.
# Count characters
tinu %>% str_count()
## [1] 12026
#tinu %>% stringr::str_count()
this is the size or the amount of words in the tinu data set
# Change to a tibble (tidy dataframe)
tokens_speech <- tibble(tinu)
# Tokenize
tokens_speech <- tokens_speech %>% tidytext::unnest_tokens(output=word, input=tinu, token='words', to_lower=TRUE)
tokenizing splits the corpus into words using the tokenize_word function. In this case the result is a list with 1993 items in it, each representing a specific word.
#add order of the words
tokens_speech <- tokens_speech %>% mutate(order = row_number())
# Count tokens
tokens_speech %>% nrow()
## [1] 1993
lets view few words
# First few words
tokens_speech[1:30, ]
## # A tibble: 30 × 2
## word order
## <chr> <int>
## 1 my 1
## 2 fellow 2
## 3 citizens 3
## 4 i 4
## 5 stand 5
## 6 before 6
## 7 you 7
## 8 honoured 8
## 9 to 9
## 10 assume 10
## # ℹ 20 more rows
# count the number of matches of a substring
tokens_speech %>% dplyr::filter(word == str_sub('security')) %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 5
# Where is this keyword mentioned?
tokens_speech %>% dplyr::filter(word == str_sub('security'))
## # A tibble: 5 × 2
## word order
## <chr> <int>
## 1 security 1135
## 2 security 1182
## 3 security 1183
## 4 security 1213
## 5 security 1224
# Look at the most important frequent words
tokens_speech %>%
group_by(word) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
filter(count > 10) %>%
mutate(tokens_speech = reorder(word, count)) %>%
ggplot(aes(x=count, y=tokens_speech)) +
geom_col()
the most frequent words are ‘the’, ‘and’, ‘to’ and ‘of’.They don’t
really make sense in this analysis, they are called stop word in
English, lets filter out the stop words by using anti_join functions
# Remove stop words
library(tidytext)
tokens_speech <- tokens_speech %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
tokens_speech %>% nrow()
## [1] 850
after removing stop words, the words are then plotted against their frequency of occurrence, which I reduced to 4, it will interest you to note that nation, Nigeria, policy are the words that top the chat, with frequency range of about 9 to 14. security and job at the middle, with frequency range of 5. investment is at the base of the chat with frequency range of 4, and education did not make it.
tokens_speech %>%
group_by(word) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
filter(count >=4) %>%
mutate(token = reorder(word, count)) %>%
ggplot(aes(x=count, y=token)) +
geom_col()
Stemming and lemmatizing are techniques used in natural language processing (NLP) to reduce words to their base or root form. Stemming removes prefixes, suffixes, and other affixes from words to obtain the stem or base word.
Lemmatizing, on the other hand, aims to determine the lemma,dictionary form of a word, it considers the context and grammar of the word.
the first code arranges the tokens_speech data frame by the word column and display rows 316 to 325. the second code, stemmed create a new column of word, with the stemmed versions of the words in the word column. It uses the SnowballC::wordStem()
# look at similar words
arrange(tokens_speech, word)[315:325, ]
## # A tibble: 11 × 2
## word order
## <chr> <int>
## 1 funds 1621
## 2 funds 1661
## 3 furtherance 1065
## 4 future 211
## 5 future 220
## 6 future 1911
## 7 gdp 1253
## 8 generation 1315
## 9 god 35
## 10 god 264
## 11 god 1984
#Stem the tokens
stemmed <- tokens_speech %>% mutate(stem = SnowballC::wordStem(word))
# look at similar words now
arrange(stemmed, word)[315:325, ]
## # A tibble: 11 × 3
## word order stem
## <chr> <int> <chr>
## 1 funds 1621 fund
## 2 funds 1661 fund
## 3 furtherance 1065 further
## 4 future 211 futur
## 5 future 220 futur
## 6 future 1911 futur
## 7 gdp 1253 gdp
## 8 generation 1315 gener
## 9 god 35 god
## 10 god 264 god
## 11 god 1984 god
stemmed %>%
group_by(stem) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
filter(count > 4) %>%
mutate(token = reorder(stem, count)) %>%
ggplot(aes(x=count, y=token)) +
geom_col()
after Stemming the words, frequencies increased accrose words, nation
moved from about 19 to 20, jobs from 5 to 6
set.seed(200)
stemmed %>%
group_by(word) %>%
summarize(count = n()) %>%
with(wordcloud(words=word, freq=count, min.freq=3, max.words=100, random.order=F, rot.per=0.30, colors=brewer.pal(8, "Dark2")))
Below is a brief description of the arguments used in the word cloud function; words – words to be plotted freq – frequencies of words min.freq – words whose frequency is at or above this threshold value is plotted (in this case, I have set it to 3) max.words – the maximum number of words to display on the plot (in the code above, I have set it 100) random.order – I have set it to FALSE, so the words are plotted in order of decreasing frequency rot.per – the percentage of words that are displayed as vertical text (with 90-degree rotation). I have set it 0.30 (30 %), please feel free to adjust this setting to suit your preferences colors – changes word colors going from lowest to highest frequencies
# sentiment dictionary from tidytext use for sentiment analysis
library(tidytext)
lm_dict <- get_sentiments("nrc")
lm_dict %>% group_by(sentiment) %>% summarize(count = n())
## # A tibble: 10 × 2
## sentiment count
## <chr> <int>
## 1 anger 1245
## 2 anticipation 837
## 3 disgust 1056
## 4 fear 1474
## 5 joy 687
## 6 negative 3316
## 7 positive 2308
## 8 sadness 1187
## 9 surprise 532
## 10 trust 1230
the “NRC Emotion Lexicon” developed by Saif M. Mohammad and Peter D. Turney. is One popular sentiment lexicon for political analysis It includes sentiment annotations for various emotions and is commonly used in political sentiment analysis studies. You can obtain the “NRC Emotion Lexicon” using the get_sentiments() function from the tidytext package.
# Add sentiment
sentimented <- stemmed %>%
inner_join(lm_dict, by = 'word')
## Warning in inner_join(., lm_dict, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 8491 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Explore totals
sentimented %>%
group_by(sentiment) %>%
summarize(count = n(), percent = count/nrow(sentimented))
## # A tibble: 10 × 3
## sentiment count percent
## <chr> <int> <dbl>
## 1 anger 23 0.0387
## 2 anticipation 58 0.0976
## 3 disgust 14 0.0236
## 4 fear 43 0.0724
## 5 joy 63 0.106
## 6 negative 52 0.0875
## 7 positive 177 0.298
## 8 sadness 13 0.0219
## 9 surprise 15 0.0253
## 10 trust 136 0.229
sentimented %>%
group_by(sentiment) %>%
summarize(count = n(), percent = count/nrow(sentimented)) %>%
ggplot(aes(x='', y=percent, fill=sentiment)) +
geom_bar(width=1, stat='identity')
Based on the sentiment scores provided for the political address, we can observe the following sentiments and their corresponding counts and percentages:
Anger: 23 occurrences, accounting for approximately 3.87% of the sentiment score. Anticipation: 58 occurrences, accounting for approximately 9.76% of the sentiment score. Disgust: 14 occurrences, accounting for approximately 2.36% of the sentiment score. Fear: 43 occurrences, accounting for approximately 7.24% of the sentiment score. Joy: 63 occurrences, accounting for approximately 10.61% of the sentiment score. Negative: 52 occurrences, accounting for approximately 8.75% of the sentiment score. Positive: 177 occurrences, accounting for approximately 29.80% of the sentiment score. Sadness: 13 occurrences, accounting for approximately 2.19% of the sentiment score. Surprise: 15 occurrences, accounting for approximately 2.53% of the sentiment score. Trust: 136 occurrences, accounting for approximately 22.90% of the sentiment score.
These sentiment scores provide an indication of the distribution of different sentiments expressed in the political address. From the provided data, it seems that the dominant sentiments are positive and trust, as they have the highest counts and percentages. However, it’s important to consider the context and content of the political address to have a comprehensive understanding of the sentiment expressed.