Exploratory Data Analysis of a Dataset Consisting of Seven Poems using various R libraries

Analysis of a Dataset Consisting of Seven Poems Using NLP Tools

Objective: In this study, we will implement methods such as Natural Language Processing (NLP) to perform text analytics and visualization. Most of the time, the data is available in an unstructured format, and to derive insights from it, preprocessing is required. We will use NLP algorithms to read the text and extract meaningful information. Specifically, this study analyzes seven poems and generates graphical visualizations, including word distributions, word clouds, and more, using various R libraries.

In the working session, we will upload the package we need:

Database creation

We will create our database by adding, one by one, seven poems written in English by different authors. The poems are available on the Poetryfoundation platform. The resulting database will be analyzed.

The full text poem is available here (Still I Rise By Maya Angelou)

The first few lines of the data are presented:

## [1] "You may write me down in history" "With your bitter, twisted lies," 
## [3] "You may trod me in the very dirt" "But still, like dust, I'll rise."
## [5] "Does my sassiness upset you?"     "Why are you beset with gloom?"

The full poem text is available here (Kubla Khan By Samuel Taylor Coleridge)

The first few lines of the data are presented:

## [1] "In Xanadu did Kubla Khan"             
## [2] "A stately pleasure-dome decree:"      
## [3] "Where Alph, the sacred river, ran"    
## [4] "Through caverns measureless to man"   
## [5] "Down to a sunless sea."               
## [6] "So twice five miles of fertile ground"

The full poem text is available here (Song of Myself (1892 version) By Walt Whitman)

The first few lines of the data are presented:

## [1] "Turning and turning in the widening gyre"       
## [2] "The falcon cannot hear the falconer;"           
## [3] "Things fall apart; the centre cannot hold;"     
## [4] "Mere anarchy is loosed upon the world,"         
## [5] "The blood-dimmed tide is loosed, and everywhere"
## [6] "The ceremony of innocence is drowned;"

The full poem text is available here (Because I could not stop for Death – (479) By Emily Dickinson)

The first few lines of the data are presented:

## [1] "Because I could not stop for Death –"  
## [2] "He kindly stopped for me –"            
## [3] "The Carriage held but just Ourselves –"
## [4] "And Immortality."                      
## [5] "We slowly drove – He knew no haste"    
## [6] "And I had put away"

The full poem text is available here (Ode to a Nightingale by John Keats)

The first few lines of the data are presented:

## [1] "My heart aches, and a drowsy numbness pains"
## [2] "My sense, as though of hemlock I had drunk,"
## [3] "Or emptied some dull opiate to the drains"  
## [4] "One minute past, and Lethe-wards had sunk:" 
## [5] "'Tis not through envy of thy happy lot,"    
## [6] "But being too happy in thine happiness, —"

The full poem text is available here (The Rime of the Ancient Mariner by Samuel Taylor Coleridge)

The first few lines of the data are presented:

## [1] " It is an ancient Mariner,"                
## [2] "And he stoppeth one of three."             
## [3] "By thy long grey beard and glittering eye,"
## [4] "Now wherefore stopp'st thou me?"           
## [5] "The Bridegroom's doors are opened wide,"   
## [6] "And I am next of kin;"

The full poem text is available here (Ode to the West Wind by Percy Bysshe Shelley)

The first few lines of the data are presented:

## [1] "O wild West Wind, thou breath of Autumn's being,"  
## [2] "Thou, from whose unseen presence the leaves dead"  
## [3] "Are driven, like ghosts from an enchanter fleeing,"
## [4] "Yellow, and black, and pale, and hectic red,"      
## [5] "Pestilence-stricken multitudes: O thou,"           
## [6] "Who chariotest to their dark wintry bed"

We will add a column with the name of the poem

Dataset description

The next step is to combine all created tables into one dataframe using the rbind() function.

The resulting dataset (poezii) consists of seven English poems, in which each line provides information about:

  1. the number of the line verse within the poem,

  2. the text of the verse

  3. the title of the poem to which the verse belongs

The seven poem titles from our created dataset are:

We will transform our poem dataset into a structured format that is easier to analyze and parse at the level of individual words. The unnest_tokens() function from the tidytext package breaks the verses into individual words.

All histograms show right-skewed distributions, consistent with Zipf’s law. This means that a small number of frequent words (such as stopwords or common words) dominate, while many words appear only rarely. These less frequent words are often part of the author’s unique vocabulary and include thematic words related to nature, emotions, symbolic imagery, and similar concepts. It is important to note that stopwords were not removed from the analyzed dataset.

As we can see, the most frequently used word is the, followed by and, of, and so on.

In the next step, the stopwords will be removed from the analyzed dataset, retaining only the relevant words.

It can be observed that all histograms display right-skewed distributions, consistent with Zipf’s law. This means that high-frequency words, also known as common words, dominate and appear at the beginning of the distribution, while low-frequency words occur rarely and appear at the ending of the distribution.

Distribution of the words

Graphic representation of the words distribution with occurrence frequency greater than 4.

Distribution of the words in each poem

The poem with the largest number of words is Ode to a Nightingale (with 297 words), followed by Ode to the West Wind (with 278 words), Rime of the Ancient Mariner (with 226 words) and so on, as we can see from the table below. At the opposite pole is poetry Because I could not stop for Death with 53 words.

Ode to a Nightingale and Kubla Khan poems contain the largest number of words, respectively 297 words and 185 words

In Figure below we will present the words distribution with an occurrence frequency greater than 2 in six different poems: Because I could not stop for Death, Kubla Khan, Ode to a Nightingale, Still I Rise, etc.

Each graphic presents the most used words (keywords) in each poem and their occurrence frequency.

Words distribution and the corresponding poem

Graphic representation of wordcloud

In order to create a word cloud, we will first calculate the term document matrix (TDM - Term-Document-Matrix), in which each title represents a document. For this reason, we will use tm or tidytext packages from R.

##                                     Terms
## Docs                                 abora abroad abyssinian aches adieu age
##   Kubla_Khan                             1      0          1     0     0   0
##   Ode_to_a_Nightingale                   0      1          0     1     3   1
##   Ode_to_the_West_Wind                   0      0          0     0     0   0
##   Still_I_Rise                           0      0          0     0     0   0
##   Rime_of_the_Ancient_Mariner            0      0          0     0     0   0
##   Song_of_Myself                         0      0          0     0     0   0
##   Because_I_could_not_stop_for_Death     0      0          0     0     0   0
##                                     Terms
## Docs                                 air albatross alien alph amid anarchy
##   Kubla_Khan                           1         0     0    1    1       0
##   Ode_to_a_Nightingale                 1         0     1    0    1       0
##   Ode_to_the_West_Wind                 1         0     0    0    0       0
##   Still_I_Rise                         1         0     0    0    0       0
##   Rime_of_the_Ancient_Mariner          0         3     0    0    0       0
##   Song_of_Myself                       0         0     0    0    0       1
##   Because_I_could_not_stop_for_Death   0         0     0    0    0       0
##                                     Terms
## Docs                                 ancestors ancestral ancient angels anthem
##   Kubla_Khan                                 0         1       1      0      0
##   Ode_to_a_Nightingale                       0         0       1      0      1
##   Ode_to_the_West_Wind                       0         0       0      1      0
##   Still_I_Rise                               1         0       0      0      0
##   Rime_of_the_Ancient_Mariner                0         0       4      0      0
##   Song_of_Myself                             0         0       0      0      0
##   Because_I_could_not_stop_for_Death         0         0       0      0      0
##                                     Terms
## Docs                                 approaching art ashes ate athwart
##   Kubla_Khan                                   0   0     0   0       1
##   Ode_to_a_Nightingale                         0   1     0   0       0
##   Ode_to_the_West_Wind                         1   1     1   0       0
##   Still_I_Rise                                 0   0     0   0       0
##   Rime_of_the_Ancient_Mariner                  0   0     0   1       0
##   Song_of_Myself                               0   0     0   0       0
##   Because_I_could_not_stop_for_Death           0   0     0   0       0
##                                     Terms
## Docs                                 atlantic's atmosphere autumn's autumnal
##   Kubla_Khan                                  0          0        0        0
##   Ode_to_a_Nightingale                        0          0        0        0
##   Ode_to_the_West_Wind                        1          1        1        1
##   Still_I_Rise                                0          0        0        0
##   Rime_of_the_Ancient_Mariner                 0          0        0        0
##   Song_of_Myself                              0          0        0        0
##   Because_I_could_not_stop_for_Death          0          0        0        0
##                                     Terms
## Docs                                 awful aye azure aëry bacchus backyard
##   Kubla_Khan                             0   0     0    0       0        0
##   Ode_to_a_Nightingale                   0   0     0    0       1        0
##   Ode_to_the_West_Wind                   0   0     2    1       0        0
##   Still_I_Rise                           1   0     0    0       0        1
##   Rime_of_the_Ancient_Mariner            0   1     0    0       0        0
##   Song_of_Myself                         0   0     0    0       0        0
##   Because_I_could_not_stop_for_Death     0   0     0    0       0        0
##                                     Terms
## Docs                                 baiae's bassoon bay beaded beaker bear
##   Kubla_Khan                               0       0   0      0      0    0
##   Ode_to_a_Nightingale                     0       0   0      1      1    0
##   Ode_to_the_West_Wind                     1       0   1      0      0    1
##   Still_I_Rise                             0       0   0      0      0    1
##   Rime_of_the_Ancient_Mariner              0       1   0      0      0    0
##   Song_of_Myself                           0       0   0      0      0    0
##   Because_I_could_not_stop_for_Death       0       0   0      0      0    0
##                                     Terms
## Docs                                 beard bearing beast beasts beat beauty bed
##   Kubla_Khan                             0       1     0      0    0      0   0
##   Ode_to_a_Nightingale                   0       0     0      0    0      1   0
##   Ode_to_the_West_Wind                   0       0     0      0    0      0   1
##   Still_I_Rise                           0       0     0      0    0      0   0
##   Rime_of_the_Ancient_Mariner            2       0     0      1    2      0   0
##   Song_of_Myself                         0       0     1      0    0      0   0
##   Because_I_could_not_stop_for_Death     0       0     0      0    0      0   0
##                                     Terms
## Docs                                 beechen bell bends beneath beset bethlehem
##   Kubla_Khan                               0    0     0       2     0         0
##   Ode_to_a_Nightingale                     1    1     0       0     0         0
##   Ode_to_the_West_Wind                     0    0     0       1     0         0
##   Still_I_Rise                             0    0     0       0     1         0
##   Rime_of_the_Ancient_Mariner              0    0     1       0     0         0
##   Song_of_Myself                           0    0     0       0     0         1
##   Because_I_could_not_stop_for_Death       0    0     0       0     0         0
##                                     Terms
## Docs                                 beware bird birds birth bitter black blank
##   Kubla_Khan                              2    0     0     0      0     0     0
##   Ode_to_a_Nightingale                    0    1     0     0      0     0     0
##   Ode_to_the_West_Wind                    0    0     0     1      0     2     0
##   Still_I_Rise                            0    0     0     0      1     1     0
##   Rime_of_the_Ancient_Mariner             0    0     0     0      0     0     0
##   Song_of_Myself                          0    0     1     0      0     0     1
##   Because_I_could_not_stop_for_Death      0    0     0     0      0     0     0
##                                     Terms
## Docs                                 blast bleed blood blooms blossomed blow
##   Kubla_Khan                             0     0     0      0         1    0
##   Ode_to_a_Nightingale                   0     0     0      0         0    0
##   Ode_to_the_West_Wind                   0     1     0      1         0    1
##   Still_I_Rise                           0     0     0      0         0    0
##   Rime_of_the_Ancient_Mariner            2     0     0      0         0    1
##   Song_of_Myself                         0     0     1      0         0    0
##   Because_I_could_not_stop_for_Death     0     0     0      0         0    0
##                                     Terms
## Docs                                 blown blue blushful body born boughs bow
##   Kubla_Khan                             0    0        0    0    0      0   0
##   Ode_to_a_Nightingale                   1    0        1    0    1      1   0
##   Ode_to_the_West_Wind                   0    2        0    0    0      1   0
##   Still_I_Rise                           0    0        0    0    0      0   0
##   Rime_of_the_Ancient_Mariner            0    0        0    0    0      0   1
##   Song_of_Myself                         0    0        0    1    1      0   0
##   Because_I_could_not_stop_for_Death     0    0        0    0    0      0   0
##                                     Terms
## Docs                                 bow'd bowed boyhood brain breast breath
##   Kubla_Khan                             0     0       0     0      0      0
##   Ode_to_a_Nightingale                   0     0       0     1      0      1
##   Ode_to_the_West_Wind                   1     0       1     0      0      1
##   Still_I_Rise                           0     1       0     0      0      0
##   Rime_of_the_Ancient_Mariner            0     0       0     0      2      0
##   Song_of_Myself                         0     0       0     0      0      0
##   Because_I_could_not_stop_for_Death     0     0       0     0      0      0
##                                     Terms
## Docs                                 breathing breezes bride bridegroom's
##   Kubla_Khan                                 1       0     0            0
##   Ode_to_a_Nightingale                       0       1     0            0
##   Ode_to_the_West_Wind                       0       0     0            0
##   Still_I_Rise                               0       0     0            0
##   Rime_of_the_Ancient_Mariner                0       0     1            1
##   Song_of_Myself                             0       0     0            0
##   Because_I_could_not_stop_for_Death         0       0     0            0
##                                     Terms
## Docs                                 bright brim bringing broken bubbles buds
##   Kubla_Khan                              1    0        0      0       0    0
##   Ode_to_a_Nightingale                    0    1        0      0       1    0
##   Ode_to_the_West_Wind                    1    0        0      0       0    1
##   Still_I_Rise                            0    0        1      1       0    0
##   Rime_of_the_Ancient_Mariner             3    0        0      0       0    0
##   Song_of_Myself                          0    0        0      0       0    0
##   Because_I_could_not_stop_for_Death      0    0        0      0       0    0
##                                     Terms
## Docs                                 build buried burst call’d carriage
##   Kubla_Khan                             1      0     1      0        0
##   Ode_to_a_Nightingale                   0      1     0      1        0
##   Ode_to_the_West_Wind                   0      0     1      0        0
##   Still_I_Rise                           0      0     0      0        0
##   Rime_of_the_Ancient_Mariner            0      0     0      0        0
##   Song_of_Myself                         0      0     0      0        0
##   Because_I_could_not_stop_for_Death     0      0     0      0        1
##                                     Terms
## Docs                                 casements caverns caves cease ceaseless
##   Kubla_Khan                                 0       2     3     0         1
##   Ode_to_a_Nightingale                       1       0     0     1         0
##   Ode_to_the_West_Wind                       0       0     0     0         0
##   Still_I_Rise                               0       0     0     0         0
##   Rime_of_the_Ancient_Mariner                0       0     0     0         0
##   Song_of_Myself                             0       0     0     0         0
##   Because_I_could_not_stop_for_Death         0       0     0     0         0
##                                     Terms
## Docs                                 cedarn centre centuries
##   Kubla_Khan                              1      0         0
##   Ode_to_a_Nightingale                    0      0         0
##   Ode_to_the_West_Wind                    0      0         0
##   Still_I_Rise                            0      0         0
##   Rime_of_the_Ancient_Mariner             0      0         0
##   Song_of_Myself                          0      1         1
##   Because_I_could_not_stop_for_Death      0      0         1

It can be seen that the most or frequent word is thou, thy, hear etc.

It can also be seen that the poem with the most significant number of words is Ode to a Nightingale, Ode to the West Wind, Rime of the Ancient Mariner etc.

Comparative analysis of common words frequency in the Still I rise poem written by Bysshe Shelley and the other poems

In the next step, we will graphically represent the common words across analyzed poems and compare them with the poem write by Bysshe Shelley. To do that, first we must find the common words from Ode to the West Wind poem written by Bysshe Shelley and the other poems:

We will use the text peome witten by Bysshe Shelley as a reference to which the other poetry are compared. Word that are close to the line have similar frequencies in both poems. As can be seen from the table above, the common words that Bysshe Shelley uses in Ode to the West Wind poetry also appear in Emily Dickinson, John_Keats etc. poems, but the proportion is different. The first 2 common words are day and quivering and appear in both Bysshe Shelley and Emily Dickinson poetry and have the greatest proportion (1.8896%). The common words used by Bysshe Shelley that also appears in John Keats poetry are air, art, boughs etc. whit a proportion of 0.3367%. In order to have a clear image of these common words, we will graphically represent the common words from the Bysshe Shelley poetry and the poems written by the other authors.

Correlation analysis

## The minimum relative frequency of used words by Bysshe_Shelley is:  0.03956835
## The minimum relative frequency of used words by Bysshe_Shelley is:  0.003597122

As can be seen the authors that uses the same words (with frequency closer to that of words used by Bysshe Shelley in Ode to the West Wind) as Bysshe Shelley are: John Keats and Samuel Taylor Coleridge.

## The minimum relative frequency of common words is:  0.00243309
## The maximum relative frequency of common words is:  0.01886792

The correlation test results sustain what we observed above. The author that uses the same words like Bysshe Shelley is John Keats. The correlation coefficient values range from -1 to 1, indicating the strength and direction of the relationship between to variables (Bysshe Shelley words frequency and proportion - the frequency of words used by the other analyzed authors). It can be seen that the strongest correlation is between Bysshe Shelley and John Keats (0.7921, indicating a direct or positive relationship) and is statistically significant (p-value is less than 0.05 => we reject the null hypothesis, with an assumed risk of 5%). Bysshe Shelley and John Keats have a similar vocabulary and tend to use similar words with similar frequency and proportion. On the other hand, the vocabulary used by Samuel Taylor Coleridge is different from the one used by Bysshe Shelley. This two authors use different words or the frequency of the used words is very different.

References:

  1. Qiushi Yan, Notes for Text Mining with R: A Tidy Approach, 2020, https://bookdown.org/Maxine/tidy-text-mining/compare-word-frequency.html accessed on 14.01.2025.

  2. https://github.com/dgrtwo/tidy-text-mining/blob/master/01-tidy-text.Rmd accessed on 14.01.2025.

  3. Silge, Julia and David Robinson. “Text Mining with R: A Tidy Approach.”, 2017.