Objective: In this study, we will implement methods such as Natural Language Processing (NLP) to perform text analytics and visualization. Most of the time, the data is available in an unstructured format, and to derive insights from it, preprocessing is required. We will use NLP algorithms to read the text and extract meaningful information. Specifically, this study analyzes seven poems and generates graphical visualizations, including word distributions, word clouds, and more, using various R libraries.
In the working session, we will upload the package we need:
We will create our database by adding, one by one, seven poems written in English by different authors. The poems are available on the Poetryfoundation platform. The resulting database will be analyzed.
The full text poem is available here (Still I Rise By Maya Angelou)
The first few lines of the data are presented:
## [1] "You may write me down in history" "With your bitter, twisted lies,"
## [3] "You may trod me in the very dirt" "But still, like dust, I'll rise."
## [5] "Does my sassiness upset you?" "Why are you beset with gloom?"
The full poem text is available here (Kubla Khan By Samuel Taylor Coleridge)
The first few lines of the data are presented:
## [1] "In Xanadu did Kubla Khan"
## [2] "A stately pleasure-dome decree:"
## [3] "Where Alph, the sacred river, ran"
## [4] "Through caverns measureless to man"
## [5] "Down to a sunless sea."
## [6] "So twice five miles of fertile ground"
The full poem text is available here (Song of Myself (1892 version) By Walt Whitman)
The first few lines of the data are presented:
## [1] "Turning and turning in the widening gyre"
## [2] "The falcon cannot hear the falconer;"
## [3] "Things fall apart; the centre cannot hold;"
## [4] "Mere anarchy is loosed upon the world,"
## [5] "The blood-dimmed tide is loosed, and everywhere"
## [6] "The ceremony of innocence is drowned;"
The full poem text is available here (Because I could not stop for Death – (479) By Emily Dickinson)
The first few lines of the data are presented:
## [1] "Because I could not stop for Death –"
## [2] "He kindly stopped for me –"
## [3] "The Carriage held but just Ourselves –"
## [4] "And Immortality."
## [5] "We slowly drove – He knew no haste"
## [6] "And I had put away"
The full poem text is available here (Ode to a Nightingale by John Keats)
The first few lines of the data are presented:
## [1] "My heart aches, and a drowsy numbness pains"
## [2] "My sense, as though of hemlock I had drunk,"
## [3] "Or emptied some dull opiate to the drains"
## [4] "One minute past, and Lethe-wards had sunk:"
## [5] "'Tis not through envy of thy happy lot,"
## [6] "But being too happy in thine happiness, —"
The full poem text is available here (The Rime of the Ancient Mariner by Samuel Taylor Coleridge)
The first few lines of the data are presented:
## [1] " It is an ancient Mariner,"
## [2] "And he stoppeth one of three."
## [3] "By thy long grey beard and glittering eye,"
## [4] "Now wherefore stopp'st thou me?"
## [5] "The Bridegroom's doors are opened wide,"
## [6] "And I am next of kin;"
The full poem text is available here (Ode to the West Wind by Percy Bysshe Shelley)
The first few lines of the data are presented:
## [1] "O wild West Wind, thou breath of Autumn's being,"
## [2] "Thou, from whose unseen presence the leaves dead"
## [3] "Are driven, like ghosts from an enchanter fleeing,"
## [4] "Yellow, and black, and pale, and hectic red,"
## [5] "Pestilence-stricken multitudes: O thou,"
## [6] "Who chariotest to their dark wintry bed"
We will add a column with the name of the poem
The next step is to combine all created tables into one dataframe using the rbind() function.
The resulting dataset (poezii) consists of seven English poems, in which each line provides information about:
the number of the line verse within the poem,
the text of the verse
the title of the poem to which the verse belongs
The seven poem titles from our created dataset are:
We will transform our poem dataset into a structured format that is easier to analyze and parse at the level of individual words. The unnest_tokens() function from the tidytext package breaks the verses into individual words.
All histograms show right-skewed distributions, consistent with Zipf’s law. This means that a small number of frequent words (such as stopwords or common words) dominate, while many words appear only rarely. These less frequent words are often part of the author’s unique vocabulary and include thematic words related to nature, emotions, symbolic imagery, and similar concepts. It is important to note that stopwords were not removed from the analyzed dataset.
As we can see, the most frequently used word is the, followed by and, of, and so on.
In the next step, the stopwords will be removed from the analyzed dataset, retaining only the relevant words.
It can be observed that all histograms display right-skewed distributions, consistent with Zipf’s law. This means that high-frequency words, also known as common words, dominate and appear at the beginning of the distribution, while low-frequency words occur rarely and appear at the ending of the distribution.
Graphic representation of the words distribution with occurrence frequency greater than 4.
The poem with the largest number of words is Ode to a Nightingale (with 297 words), followed by Ode to the West Wind (with 278 words), Rime of the Ancient Mariner (with 226 words) and so on, as we can see from the table below. At the opposite pole is poetry Because I could not stop for Death with 53 words.
Ode to a Nightingale and Kubla Khan poems contain the largest number of words, respectively 297 words and 185 words
In Figure below we will present the words distribution with an occurrence frequency greater than 2 in six different poems: Because I could not stop for Death, Kubla Khan, Ode to a Nightingale, Still I Rise, etc.
Each graphic presents the most used words (keywords) in each poem and their occurrence frequency.
In Because I could not stop for Death poem, words like passed appears 4 times;
In Kubla Khan poem words like dome and sunny appear 5 times, while words like sacred, river, pleasure, heard and caves appear 3 times
In the poem Ode to a Nightingale, the most frequent words are: thou and thee that appear 5 times each time, while thy appear 4 times.
In the poem Still I Rise, the word rise appears 10 times being, in this case, the theme of the poem.
In order to create a word cloud, we will first calculate the term document matrix (TDM - Term-Document-Matrix), in which each title represents a document. For this reason, we will use tm or tidytext packages from R.
## Terms
## Docs abora abroad abyssinian aches adieu age
## Kubla_Khan 1 0 1 0 0 0
## Ode_to_a_Nightingale 0 1 0 1 3 1
## Ode_to_the_West_Wind 0 0 0 0 0 0
## Still_I_Rise 0 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0 0 0 0
## Song_of_Myself 0 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs air albatross alien alph amid anarchy
## Kubla_Khan 1 0 0 1 1 0
## Ode_to_a_Nightingale 1 0 1 0 1 0
## Ode_to_the_West_Wind 1 0 0 0 0 0
## Still_I_Rise 1 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 3 0 0 0 0
## Song_of_Myself 0 0 0 0 0 1
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs ancestors ancestral ancient angels anthem
## Kubla_Khan 0 1 1 0 0
## Ode_to_a_Nightingale 0 0 1 0 1
## Ode_to_the_West_Wind 0 0 0 1 0
## Still_I_Rise 1 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 4 0 0
## Song_of_Myself 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0
## Terms
## Docs approaching art ashes ate athwart
## Kubla_Khan 0 0 0 0 1
## Ode_to_a_Nightingale 0 1 0 0 0
## Ode_to_the_West_Wind 1 1 1 0 0
## Still_I_Rise 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0 1 0
## Song_of_Myself 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0
## Terms
## Docs atlantic's atmosphere autumn's autumnal
## Kubla_Khan 0 0 0 0
## Ode_to_a_Nightingale 0 0 0 0
## Ode_to_the_West_Wind 1 1 1 1
## Still_I_Rise 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0 0
## Song_of_Myself 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0
## Terms
## Docs awful aye azure aëry bacchus backyard
## Kubla_Khan 0 0 0 0 0 0
## Ode_to_a_Nightingale 0 0 0 0 1 0
## Ode_to_the_West_Wind 0 0 2 1 0 0
## Still_I_Rise 1 0 0 0 0 1
## Rime_of_the_Ancient_Mariner 0 1 0 0 0 0
## Song_of_Myself 0 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs baiae's bassoon bay beaded beaker bear
## Kubla_Khan 0 0 0 0 0 0
## Ode_to_a_Nightingale 0 0 0 1 1 0
## Ode_to_the_West_Wind 1 0 1 0 0 1
## Still_I_Rise 0 0 0 0 0 1
## Rime_of_the_Ancient_Mariner 0 1 0 0 0 0
## Song_of_Myself 0 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs beard bearing beast beasts beat beauty bed
## Kubla_Khan 0 1 0 0 0 0 0
## Ode_to_a_Nightingale 0 0 0 0 0 1 0
## Ode_to_the_West_Wind 0 0 0 0 0 0 1
## Still_I_Rise 0 0 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 2 0 0 1 2 0 0
## Song_of_Myself 0 0 1 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0 0
## Terms
## Docs beechen bell bends beneath beset bethlehem
## Kubla_Khan 0 0 0 2 0 0
## Ode_to_a_Nightingale 1 1 0 0 0 0
## Ode_to_the_West_Wind 0 0 0 1 0 0
## Still_I_Rise 0 0 0 0 1 0
## Rime_of_the_Ancient_Mariner 0 0 1 0 0 0
## Song_of_Myself 0 0 0 0 0 1
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs beware bird birds birth bitter black blank
## Kubla_Khan 2 0 0 0 0 0 0
## Ode_to_a_Nightingale 0 1 0 0 0 0 0
## Ode_to_the_West_Wind 0 0 0 1 0 2 0
## Still_I_Rise 0 0 0 0 1 1 0
## Rime_of_the_Ancient_Mariner 0 0 0 0 0 0 0
## Song_of_Myself 0 0 1 0 0 0 1
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0 0
## Terms
## Docs blast bleed blood blooms blossomed blow
## Kubla_Khan 0 0 0 0 1 0
## Ode_to_a_Nightingale 0 0 0 0 0 0
## Ode_to_the_West_Wind 0 1 0 1 0 1
## Still_I_Rise 0 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 2 0 0 0 0 1
## Song_of_Myself 0 0 1 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs blown blue blushful body born boughs bow
## Kubla_Khan 0 0 0 0 0 0 0
## Ode_to_a_Nightingale 1 0 1 0 1 1 0
## Ode_to_the_West_Wind 0 2 0 0 0 1 0
## Still_I_Rise 0 0 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0 0 0 0 1
## Song_of_Myself 0 0 0 1 1 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0 0
## Terms
## Docs bow'd bowed boyhood brain breast breath
## Kubla_Khan 0 0 0 0 0 0
## Ode_to_a_Nightingale 0 0 0 1 0 1
## Ode_to_the_West_Wind 1 0 1 0 0 1
## Still_I_Rise 0 1 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0 0 2 0
## Song_of_Myself 0 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs breathing breezes bride bridegroom's
## Kubla_Khan 1 0 0 0
## Ode_to_a_Nightingale 0 1 0 0
## Ode_to_the_West_Wind 0 0 0 0
## Still_I_Rise 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 1 1
## Song_of_Myself 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0
## Terms
## Docs bright brim bringing broken bubbles buds
## Kubla_Khan 1 0 0 0 0 0
## Ode_to_a_Nightingale 0 1 0 0 1 0
## Ode_to_the_West_Wind 1 0 0 0 0 1
## Still_I_Rise 0 0 1 1 0 0
## Rime_of_the_Ancient_Mariner 3 0 0 0 0 0
## Song_of_Myself 0 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0 0
## Terms
## Docs build buried burst call’d carriage
## Kubla_Khan 1 0 1 0 0
## Ode_to_a_Nightingale 0 1 0 1 0
## Ode_to_the_West_Wind 0 0 1 0 0
## Still_I_Rise 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0 0 0
## Song_of_Myself 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 1
## Terms
## Docs casements caverns caves cease ceaseless
## Kubla_Khan 0 2 3 0 1
## Ode_to_a_Nightingale 1 0 0 1 0
## Ode_to_the_West_Wind 0 0 0 0 0
## Still_I_Rise 0 0 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0 0 0
## Song_of_Myself 0 0 0 0 0
## Because_I_could_not_stop_for_Death 0 0 0 0 0
## Terms
## Docs cedarn centre centuries
## Kubla_Khan 1 0 0
## Ode_to_a_Nightingale 0 0 0
## Ode_to_the_West_Wind 0 0 0
## Still_I_Rise 0 0 0
## Rime_of_the_Ancient_Mariner 0 0 0
## Song_of_Myself 0 1 1
## Because_I_could_not_stop_for_Death 0 0 1
It can be seen that the most or frequent word is thou, thy, hear etc.
It can also be seen that the poem with the most significant number of words is Ode to a Nightingale, Ode to the West Wind, Rime of the Ancient Mariner etc.
In the next step, we will graphically represent the common words across analyzed poems and compare them with the poem write by Bysshe Shelley. To do that, first we must find the common words from Ode to the West Wind poem written by Bysshe Shelley and the other poems:
We will use the text peome witten by Bysshe Shelley as a reference to which the other poetry are compared. Word that are close to the line have similar frequencies in both poems. As can be seen from the table above, the common words that Bysshe Shelley uses in Ode to the West Wind poetry also appear in Emily Dickinson, John_Keats etc. poems, but the proportion is different. The first 2 common words are day and quivering and appear in both Bysshe Shelley and Emily Dickinson poetry and have the greatest proportion (1.8896%). The common words used by Bysshe Shelley that also appears in John Keats poetry are air, art, boughs etc. whit a proportion of 0.3367%. In order to have a clear image of these common words, we will graphically represent the common words from the Bysshe Shelley poetry and the poems written by the other authors.
## The minimum relative frequency of used words by Bysshe_Shelley is: 0.03956835
## The minimum relative frequency of used words by Bysshe_Shelley is: 0.003597122
As can be seen the authors that uses the same words (with frequency closer to that of words used by Bysshe Shelley in Ode to the West Wind) as Bysshe Shelley are: John Keats and Samuel Taylor Coleridge.
## The minimum relative frequency of common words is: 0.00243309
## The maximum relative frequency of common words is: 0.01886792
The correlation test results sustain what we observed above. The author that uses the same words like Bysshe Shelley is John Keats. The correlation coefficient values range from -1 to 1, indicating the strength and direction of the relationship between to variables (Bysshe Shelley words frequency and proportion - the frequency of words used by the other analyzed authors). It can be seen that the strongest correlation is between Bysshe Shelley and John Keats (0.7921, indicating a direct or positive relationship) and is statistically significant (p-value is less than 0.05 => we reject the null hypothesis, with an assumed risk of 5%). Bysshe Shelley and John Keats have a similar vocabulary and tend to use similar words with similar frequency and proportion. On the other hand, the vocabulary used by Samuel Taylor Coleridge is different from the one used by Bysshe Shelley. This two authors use different words or the frequency of the used words is very different.
Qiushi Yan, Notes for Text Mining with R: A Tidy Approach, 2020, https://bookdown.org/Maxine/tidy-text-mining/compare-word-frequency.html accessed on 14.01.2025.
https://github.com/dgrtwo/tidy-text-mining/blob/master/01-tidy-text.Rmd accessed on 14.01.2025.
Silge, Julia and David Robinson. “Text Mining with R: A Tidy Approach.”, 2017.