Definitions: General

Column

All Please Read!

How to use

For our final projects, no reason for each person in the class to come up with their own definitions of Zipf’s Law, TF-IDF, etc. Our CLA 3206A: Standardized Definitions guide is designed so that you can cut-n-paste in what you need. If your layout is Rows, and not Columns, no problem.

From the source RMD file, you need the text information inside the div tag, not the page layout information. So copy from just before the opening the div tag and just after the ending the div tag. Paste that inside your Flexdashboard.

The div tag itself will not affect your layout or design in anyway. The div tag container is neutral, just showing content boundaries, and has no built-in style or format effects.

Your Author Info

You may include the information below in a separate tab, or you may just place it under your Project information section. But make sure to include the block below, revised as appropriate:

This project, your title here underlined , was submitted on 5 January 2021 by [YOUR CHINESE NAME], [ID: ### last three numbers only], in partial fulfillment of the requirements for CLA 3206A: Text Mining for Liberal Arts Majors, Shantou University, Fall Semester 2020.

Editing Definitions for Clarity

You might not need both the general and specific explanations; or, perhaps you might want to combine the two with some editing. Please do so, but cite accordingly. Always include the following (which is already part of the definition div):

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Zipf’s Law

Zipf’s law is an empirical generalization based on studies of natural language text corpora. It has repeatedly been shown to have merit and good predictive power even when applied to very different collections of documents (text corpora) in all studied natural languages, including extinct languages. It examines the relationship between term frequency (how often a term is used the text corpus) and term rank.

By definition, the most frequently used term has rank 1 (the lowest rank); the second most frequent term, rank 2 (the second lowest rank); and the least frequent term, the highest rank. Beyond this simple ordering pattern, Zipf’s law states that the “frequency of any word is inversely proportional to its rank in the frequency table” {1}. In other words, as the Wikipedia entry {1} explains, for a given collection of documents (a text corpus), Zipf’s law predicts that “the most frequent word [rank 1] will occur approximately twice as often as the second most frequent word [rank 2], three times as often as the third most frequent word [rank 3], etc.”

Since the relationship is inversely proportional, and hence a power law distribution, we can test for Zipf law conformity by plotting on the logarithmic scale (log-log) the term frequency (Y axis) as a response to term rank (X axis).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

TF-IDF

We have a text corpus composed of documents. (In CLA 3206A, typically a collection of novels. So the corpus is the collection; the documents, the individual novels). TF-IDF, term frequency–inverse document frequency, measures “how important a word is to a document in a collection” {2}.

We know that the frequently used terms in English, words such as “the”, “to”, “and”, and “of”, provide us with little insight about the document’s topics or distinct content. So low information value. But we also sense that if a term occurs often in one document, but not nearly as much in the other documents, that word likely does both relate to the content and help distinguish the document. So higher information value. TF-IDF, a widely used statistical measure for text-mining and informational retrieval, provides a formal mathematical expression of that intuition. It balances the document TF score against how often the term occurs in the rest of the corpus, the IDF. If a term is used often in the document and corpus, it has a low to (effectively) zero TF-IDF score. If the term appears often in the document, but rarely in the corpus: high TF-IDF score. So by this method, TF-IDF indicates the information value of a term.

(See also Silge & Robinson, “Analyzing word and document frequency: tf-idf”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Sentiment Analysis

To cite a standard definition:

Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. {3}

In simpler terms, we are mapping the emotional vocabulary of a document or corpus. To identify that emotional vocabulary, we use SA lexicons. For CLA 3206A, the lexicons nrc and bing. The SA analysis is relative to the lexicon, and some of the sentiment identifications for any given lexicon might be questionable. But so long as that lexicon is consistent and applied so, the results are comparable and the vocabulary mappings identify evidence-based patterns of language usage.

We use SA in CLA 3206A less to precisely identify affective states and more to map out lexicon term usage. These mappings unquestionably provide insight into document style and content.

(See also Silge & Robinson, “Sentiment analysis with tidy data”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Column

Bigrams & Word Pairs

To cite a standard definition:

A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. {4}

The CLA 3206A bigram analysis methods focused on adjacent words, also known as word pairs (when separated). In some cases, as is standard Tidytext TM practice, stop-words were used to produce cleaned data sets of bigrams and word pairs {5}.

Bigrams and word pairs typically provide valuable insight into the major topics of a document or corpus. Word pairs also typically provide insight about the document style, and about word-associations. These in turn may reveal important linkages between content and ideas.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Word Corrs

Word Pairs vs. Word Corrs:
Word Pairs consider adjacent words: term1, term2. This has a linear order: the way we naturally read a text. For some examples from Nathaniel Hawthorne’s The Scarlet Letter, designating term2 as “child”: “elf child”, “strange child”, “naughty child”, and “poor child”. In contrast, Word Correlations consider document sections {6}. If termX appears in a given section, what other terms are likely to appear in that same section? This can be anywhere in the section: any place before or after termX, not just adjacent.

Word Corr Findings :
So a Word Correlation analysis tells us what words are associated with termX, and the strength of those associations {6}. These word clusters, centered on termX, reveal linkages of language and thought that might otherwise escape our attention since we typically read in a linear fashion. In the case of an individual author, they help show that author’s linguistic habits – the unconscious as well as conscious mind at work. For a corpus of different authors, they can help reveal underlying assumptions – assumptions perhaps even unknown to the authors being studied!

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Gender

The Gender Analyses in CLA 3206A follow the lead of Julia Silge’s study, for which Silge herself credits the break-through academic study by Professors Matthew Jockers and Gabi Kirilloff, “Understanding Gender and Character Agency in the 19th Century Novel”, Journal of Cultural Analytics (2017).

Approaching the topic of “character identity [as realized through] character action”, Jockers and Kirilloff(2017) examined “character agency in the context of character gender” by examining “trends in behavior associated with male and female characters”. They did so by studying patterns of gender pronoun usage: she what? he what? Similar to Silge, CLA 3206A follows their general method, though with differences on the technical (coding) level.

She | He
By seeing what words – and particularly verbs – follow “she” or “he”, we can gain insight into how these texts portray gender and hence gender roles: “character agency in the context of character gender”, as Jockers and Kirilloff (2017) express it.

Her | His
In contrast to the pronouns “she” and “he” which tend to be followed by verbs, the pronouns “her”, “hers”, and “herself”, and “his”, “him”, and “himself”, tend to be followed by nouns. These noun-clusters, the pronoun group and its associations, likewise provide insight into gender conceptualization. They typically also help reveal key content and provide insight into topics.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Definitions: Specific

Column

Plutchik SA

The Sentiment Analysis lexicon nrc categorizes terms – the identified vocabulary – according to eight primary emotions as defined by the psychologist Robert Plutchik {7}. The eight basic emotions according to Plutchik are: Anger, Anticipation, Joy, Trust, Fear, Surprise, Sadness, and Disgust.

Radar Chart
The Plutchik radar chart matches the frequency of text terms to the lexicon, and maps out the text’s emotive vocabulary (as identified by nrc) along the eight Plutchik emotive axes. It does not indicate where the vocabulary appears in the document, only the overall concentration or “shape”.

SA Lines Plot
The Plutchik SA lines graph maps a document section by section in a linear fashion: similar to how one typically reads a text. Like the radar chart, it also matches the frequency of text terms to the lexicon. But it provides a strong indication of where the emotive vocabulary appears in the document and shows the pattern of usage.

In the case of a literary work such as novel, the SA Lines graph provides valuable insight into the document’s literary style. Likewise, the emotive vocabulary patterns revealed also have some relationship to the story arc (plot) and character development.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

WP Network

A network graph can reveal the structure of relationships among the word pairs, and not just a simple count of frequency or even tf-idf value. It potentially reveals connections and linkages that we might otherwise miss if the word pairs were plotted in a bar graph, or listed in a table, or if the text were read in a natural linear fashion.

Although relational, the Word Pairs network graph is also directional: it proceeds from term 1 to term 2, as indicated by the line and arrow. Indeed, as Silge & Robinson point out: the Word Pairs network graph is also a “visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word” {8}.

In brief, the Word Pairs network graph provides us with an overview of key word-pairs that also indicates their relationships within the the document (or documents).

(See also Silge & Robinson, “Relationships between words: n-grams and correlations”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

WC Network

Word Correlation analyses discover which terms co-occur with each other, as per a specified section of a document, and the importance (statistical significance) of that co-occurrence as measured by the correlation value {9}.

A network graph of word correlations displays linkages and associations that are not as easily captured in a bar plot or data table, and are typically missed when reading the text in a natural linear fashion. Unlike a Word Pairs network graph, a Word Correlations network graph is not directional: rather, it is cluster-centered depending on the strength of correlation.

In brief, the Word Correlations network graph reveals valuable term clusters (empirically determined word associations) which provide information about the deeper layers of language-usage and thought in the document (or documents).

(See also Silge & Robinson, “Relationships between words: n-grams and correlations”).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Column

Commonality Cloud

From the R wordcloud package, the commonality cloud reveals the terms shared across the documents in the corpus selected for plotting. It shows only the absolute intersection: the terms shared in common by all the documents selected. For example, if a commonality cloud of six novels is plotted, and a term is present in only five of the six novels, it will NOT be showed. The relative size of the term shown indicates its total count across all the documents selected.

The commonality cloud maps out continuity between documents in terms of their vocabulary.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Comparison Cloud

From the R wordcloud package, the comparison cloud reveals the key differences in term frequency (or, tf-idf value if chosen instead) between two or more documents. This comparison is hence relative not absolute. It does not show the absolute differences between documents in term usage. Instead, it shows the strongest differences in term frequency (or tf-idf). The same term could appear in all of the documents selected for plotting, but it would only show for one document (if at all) if the term appears significantly more in that document than the others. Likewise, a term that appears often in one document, but does not appear in the other documents, would likely also be displayed for that document. The relative display size indicates the importance.

The comparison cloud maps out key changes (differences) between documents in terms of their vocabulary.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

SA Cloud

The sentiment analysis cloud is a specialized type of comparison cloud. Rather than two or more documents being passed to the function comparison.cloud(), the data is first transformed by mapping it to a SA lexicon. This creates two (or more, depending on the lexicon) data categories. For example, using the bing lexicon will map out the terms in a document (or documents) to “positive” and “negative” {10}. The data is then passed to the function to be contrasted by category, “positive” and “negative” in our example. The “positive” and “negative” terms are displayed as contrasting sides in the comparison cloud, with the size indicating the relative importance in term frequency difference.

The SA cloud provides an overview of and insight into emotive language and hence in part the style of a document (or set of documents).

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

Word Cloud

From the R wordcloud package, the standard word cloud maps out the top-ranked terms in the document (or documents) selected based on their term frequency, or tf-idf score, or another specified numerical variable used to weight the terms. Typically, just term frequency is used, and so the terms that occur the most often are displayed. The higher the term score, the larger it appears in the word cloud. The sizing is scaled: proportionally, not directly, representing the numerical value.

The standard word cloud provides an overview at a glance, often displaying key topic and content words.

Source: CLA 3206A TM Guide. TJ Haslam, CC-BY-4.0, 2020. In-text citations by URL link.

---
title: "CLA 3206A: Standardized Definitions"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    theme: flatly
    source: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
```




Definitions: General 
======================================================================

Column {.tabset .tabset-fade}
-------------------------------------

### All Please Read!

#### How to use


For our final projects, no reason for each person in the class to come up with their own definitions of Zipf's Law, TF-IDF, etc. Our *CLA 3206A: Standardized Definitions* guide is designed so that you can cut-n-paste in what you need. If your layout is Rows, and not Columns, no problem.  

From the source RMD file, you need the text information inside the **div** tag, not the page layout information. So copy from just before the opening the **div** tag and just after the ending the **div** tag.  Paste that inside your Flexdashboard.  

The **div** tag itself will not affect your layout or design in anyway. The **div** tag container is neutral, just showing content boundaries, and has no built-in style or format effects.




#### Your Author Info
You may include the information below in a separate tab, or you may just place it under your Project information section.  But make sure to include the block below, revised as appropriate: 


This project, your title here underlined , was submitted on 5 January 2021 by [YOUR CHINESE NAME], [ID: ### last three numbers only], in partial fulfillment of the requirements for  CLA 3206A: Text Mining for Liberal Arts Majors, Shantou University, Fall Semester 2020.


#### Editing Definitions for Clarity
You might not need both the general and specific explanations; or, perhaps you might want to combine the two with some editing.  Please do so, but cite accordingly.  Always include the following (which is already part of the definition **div**):
**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.

### Zipf's Law


Zipf's law is an empirical generalization based on studies of natural language text corpora.  It has repeatedly been shown to have merit and good predictive power even when applied to very different collections of documents (text corpora) in all studied natural languages, including extinct languages. It examines the relationship between term frequency (how often a term is used the text corpus) and term rank. 

By definition, the most frequently used term has rank 1 (the lowest rank); the second most frequent term, rank 2 (the second lowest rank); and the least frequent term, the highest rank.  Beyond this simple ordering pattern, Zipf’s law states  that the “frequency of any word is inversely proportional to its rank in the frequency table” [{1}](https://en.wikipedia.org/wiki/Zipf%27s_law).  In other words, as the *Wikipedia* entry [{1}](https://en.wikipedia.org/wiki/Zipf%27s_law) explains, for a given collection of documents (a text corpus), Zipf’s law predicts that “the most frequent word [rank 1] will occur approximately twice as often as the second most frequent word [rank 2], three times as often as the third most frequent word [rank 3], etc.”  

Since the relationship is inversely proportional, and hence a power law distribution, we can test for Zipf law conformity by plotting on the logarithmic scale (log-log) the term frequency (Y axis) as a response to term rank (X axis).


**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.


### TF-IDF


We have a text corpus composed of documents. (In *CLA 3206A*, typically a collection of novels. So the corpus is the collection; the documents, the individual novels). TF-IDF, *term frequency–inverse document frequency*, measures “how important a word is to a document in a collection” [{2}](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). 

We know that the frequently used terms in English, words such as “the”, “to”, “and”, and “of”, provide us with little insight about the document’s topics or distinct content. So *low information value*.  But we also sense that if a term occurs often in one document, but not nearly as much in the other documents, that word likely does both relate to the content and help distinguish the document. So *higher information value*. TF-IDF, a widely used statistical measure for text-mining and informational retrieval, provides a formal mathematical expression of that intuition. It balances the document TF score against how often the term occurs in the rest of the corpus, the IDF.  If a term is used often in the document and corpus, it has a low to (effectively) zero TF-IDF score. If the term appears often in the document, but rarely in the corpus: high TF-IDF score. So by this method, TF-IDF indicates the *information value of a term*. 

(See also [Silge & Robinson](https://www.tidytextmining.com/tfidf.html), "Analyzing word and document frequency: tf-idf").


**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.


### Sentiment Analysis


To cite a standard definition: 
 
Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. [{3}](https://en.wikipedia.org/wiki/Sentiment_analysis)

In simpler terms, we are mapping the emotional vocabulary of a document or corpus. To identify that emotional vocabulary, we use SA lexicons.  For *CLA 3206A*, the lexicons [nrc](http://sentiment.nrc.ca/lexicons-for-research/) and [bing](https://emilhvitfeldt.github.io/textdata/reference/lexicon_bing.html). The SA analysis is relative to the lexicon, and some of the sentiment identifications for any given lexicon might be questionable. But so long as that lexicon is consistent and applied so, the results are comparable and the vocabulary mappings identify evidence-based patterns of language usage. 

We use SA in *CLA 3206A* less to precisely identify affective states and more to map out lexicon term usage.  These mappings unquestionably provide insight into document style and content.

(See also [Silge & Robinson](https://www.tidytextmining.com/sentiment.html), "Sentiment analysis with tidy data").

**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.



Column {.tabset .tabset-fade}
-------------------------------------

### Bigrams & Word Pairs


To cite a standard definition: 
 
A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on. [{4}](https://en.wikipedia.org/wiki/Bigram)

The *CLA 3206A* bigram analysis methods focused on adjacent words, also known as *word pairs* (when separated). In some cases, as is standard Tidytext TM practice, [stop-words were used](https://www.tidytextmining.com/ngrams.html?q=stop#ngrams) to produce cleaned data sets of bigrams and  word pairs [{5}](https://www.tidytextmining.com/ngrams.html).

Bigrams and word pairs typically provide valuable insight into the major topics of a document or corpus. Word pairs also typically provide insight about the document style, and about word-associations. These in turn may reveal important linkages between content and ideas. 


**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.







### Word Corrs



**Word Pairs vs. Word Corrs**: 

Word Pairs consider adjacent words: *term1*, *term2*.  This has a linear order: the way we naturally read a text. For some examples from Nathaniel Hawthorne’s *The Scarlet Letter*, designating *term2* as “child”: “elf child”, “strange child”, “naughty child”, and “poor child”.  In contrast, Word Correlations consider document sections [{6}](https://www.tidytextmining.com/ngrams.html#counting-and-correlating-pairs-of-words-with-the-widyr-package).  If *termX* appears in a given section, what other terms are likely to appear in that same section? This can be anywhere in the section: any place before or after *termX*, not just adjacent.

**Word Corr Findings **: 

So a Word Correlation analysis tells us what words are associated with *termX*, and the strength of those associations [{6}](https://www.tidytextmining.com/ngrams.html#counting-and-correlating-pairs-of-words-with-the-widyr-package).  These word clusters, centered on *termX*, reveal linkages of language and thought that might otherwise escape our attention since we typically read in a linear fashion. In the case of an individual author, they help show that author’s linguistic habits – the unconscious as well as conscious mind at work.  For a corpus of different authors, they can help reveal underlying assumptions – assumptions perhaps even unknown to the authors being studied!



**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.


### Gender 


The Gender Analyses in CLA 3206A follow the lead of Julia [Silge's study](https://github.com/juliasilge/old_bloggy_blog/blob/master/_R/2017-04-15-Gender-Pronouns.Rmd), for which Silge herself credits the break-through [academic study](https://culturalanalytics.org/article/11066-understanding-gender-and-character-agency-in-the-19th-century-novel) by Professors Matthew Jockers and Gabi Kirilloff, "Understanding Gender and Character Agency in the 19th Century Novel", *Journal of Cultural Analytics* (2017). 

Approaching the topic of “character identity [as realized through] character action”, [Jockers and Kirilloff](https://culturalanalytics.org/article/11066-understanding-gender-and-character-agency-in-the-19th-century-novel)(2017) examined “character agency in the context of character gender” by examining “trends in behavior associated with male and female characters". They did so by studying patterns of gender pronoun usage: *she* what? *he* what? Similar to [Silge]((https://github.com/juliasilge/old_bloggy_blog/blob/master/_R/2017-04-15-Gender-Pronouns.Rmd)), CLA 3206A follows their general method, though with differences on the technical (coding) level.

**She | He** 

By seeing what words -- and particularly verbs -- follow "she" or "he", we can gain insight into how these texts portray gender and hence gender roles:  "character agency in the context of character gender", as [Jockers and Kirilloff](https://culturalanalytics.org/article/11066-understanding-gender-and-character-agency-in-the-19th-century-novel) (2017) express it.

**Her | His**

In contrast to the pronouns “she” and “he” which tend to be followed by verbs, the pronouns “her”, “hers”, and “herself”, and “his”, “him”, and “himself”, tend to be followed by nouns. These noun-clusters, the pronoun group and its associations, likewise provide insight into gender conceptualization. They typically also help reveal key content and provide insight into topics.



**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.




Definitions: Specific 
======================================================================


Column {.tabset .tabset-fade}
-------------------------------------


### Plutchik SA


The *Sentiment Analysis* lexicon [nrc](http://sentiment.nrc.ca/lexicons-for-research/) categorizes terms – the identified vocabulary –  according to eight primary emotions as defined by the psychologist Robert Plutchik [{7}](https://en.wikipedia.org/wiki/Emotion_classification#Plutchik.27s_wheel_of_emotions). The eight basic emotions according to Plutchik are: *Anger*, *Anticipation*, *Joy*, *Trust*, *Fear*, *Surprise*, *Sadness*, and *Disgust*.


**Radar Chart**

The Plutchik radar chart matches the frequency of text terms to the lexicon, and maps out the text's emotive vocabulary (as identified by *nrc*) along the eight Plutchik emotive axes. It does not indicate where the vocabulary appears in the document, only the overall concentration or "shape".


**SA Lines Plot**

The Plutchik SA lines graph maps a document section by section in a linear fashion: similar to how one typically reads a text. Like the radar chart, it also matches the frequency of text terms to the lexicon. But it provides a strong indication of where the emotive vocabulary appears in the document and shows the pattern of usage.  

In the case of a literary work such as novel, the SA Lines graph provides valuable insight into the document's literary style.  Likewise, the emotive vocabulary patterns revealed also have some relationship to the story arc (plot) and character development.


**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.



### WP Network



A network graph can reveal the structure of relationships among the word pairs, and not just a simple count of frequency or even tf-idf value.  It potentially reveals connections and linkages that we might otherwise miss if the word pairs were plotted in a bar graph, or listed in a table, or if the text were read in a natural linear fashion. 

Although relational, the *Word Pairs* network graph is also directional: it proceeds from term 1 to term 2, as indicated by the line and arrow.  Indeed, as  Silge & Robinson point out: the *Word Pairs network graph* is also a “visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word” [{8}](https://www.tidytextmining.com/ngrams.html#visualizing-a-network-of-bigrams-with-ggraph).

In brief, the **Word Pairs network graph** provides us with an overview of key word-pairs that also indicates their relationships within the the document (or documents).

(See also [Silge & Robinson](https://www.tidytextmining.com/ngrams.html), “Relationships between words: n-grams and correlations”). 

**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.



### WC Network 



*Word Correlation* analyses discover which terms co-occur with each other, as per a specified section of a document, and the importance (statistical significance) of that co-occurrence as measured by the [correlation](https://en.wikipedia.org/wiki/Correlation_and_dependence) value [{9}](https://www.tidytextmining.com/ngrams.html#summary-3).

A network graph of word correlations displays linkages and associations that are not as easily captured in a bar plot or data table, and are typically missed when reading the text in a natural linear fashion.  Unlike a *Word Pairs* network graph, a *Word Correlations* network graph is not directional: rather, it is cluster-centered depending on the strength of correlation.

In brief, the **Word Correlations network graph** reveals valuable term clusters (empirically determined word associations) which provide information about the deeper layers of language-usage and thought in the document (or documents).

(See also [Silge & Robinson](https://www.tidytextmining.com/ngrams.html), “Relationships between words: n-grams and correlations”). 

**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.





Column {.tabset .tabset-fade}
-------------------------------------

### Commonality Cloud



From the R [wordcloud]( https://CRAN.R-project.org/package=wordcloud ) package, the **commonality cloud** reveals the terms shared across the documents in the corpus selected for plotting. It shows only the absolute intersection: the terms shared in common by all the documents selected.  For example, if a commonality cloud of six novels is plotted, and a term is present in only five of the six novels, it will NOT be showed.  The relative size of the term shown indicates its total count across all the documents selected.

The **commonality cloud** maps out continuity between documents in terms of their vocabulary.


**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.



### Comparison Cloud




From the R [wordcloud]( https://CRAN.R-project.org/package=wordcloud) package, the **comparison cloud** reveals the key differences in term frequency (or, *tf-idf* value if chosen instead) between two or more documents. This comparison is hence relative not absolute. It does not show the absolute differences between documents in term usage.  Instead, it shows the strongest differences in term frequency (or tf-idf).  The same term could appear in all of the documents selected for plotting, but it would only show for one document (if at all) if the term appears significantly more in that document than the others.  Likewise, a term that appears often in one document, but does not appear in the other documents, would likely also be displayed for that document. The relative display size indicates the  importance.

The **comparison cloud** maps out key changes (differences) between documents in terms of their vocabulary.


**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.


### SA Cloud



The **sentiment analysis cloud** is a specialized type of comparison cloud.  Rather than two or more documents being passed to the function `comparison.cloud()`, the data is first transformed by mapping it to a SA lexicon.  This creates two (or more, depending on the lexicon) data categories.  For example, using the [bing](https://emilhvitfeldt.github.io/textdata/reference/lexicon_bing.html) lexicon will map out the terms in a document (or documents) to “positive” and “negative” [{10}](https://www.tidytextmining.com/sentiment.html#wordclouds).  The data is then passed to the function to be contrasted by category, “positive” and “negative” in our example.  The  “positive” and “negative”  terms are displayed as contrasting sides in the comparison cloud, with the size indicating the relative importance in term frequency difference.

The **SA cloud** provides an overview of and insight into emotive language and hence in part the style of a document (or set of documents).


**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.


### Word Cloud 



From the R [wordcloud]( https://CRAN.R-project.org/package=wordcloud) package, the **standard word cloud** maps out the top-ranked terms in the document (or documents) selected based on their term frequency, or  tf-idf score, or another specified numerical variable used to weight the terms.  Typically, just term frequency is used, and so the terms that occur the most often are displayed.  The higher the term score, the larger it appears in the word cloud.  The sizing is scaled: proportionally, not directly, representing the numerical value.  

The **standard word cloud** provides an overview at a glance, often displaying key topic and content words.




**Source**: CLA 3206A TM Guide. TJ Haslam, [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/), 2020.  In-text citations by URL link.