Introduction

Throughout CPP 527, we have focused on numerous data science functions that are utilizable in R. After learning multiple methods to manipulate and analyze text as data, including regular expressions and the quanteda package, I decided to further explore more packages in R that are helpful for text and sentiment analysis.


Content Overview

This code-through will cover various text and sentiment analysis options in both the quanteda and tidytext packages. We will use the 2014 State of the Union Address that President Barack Obama delivered in order to analyze the words used most frequently, as well as the sentiments those words elicited.

Ensure that the following packages are installed:

library( quanteda )
## Package version: 2.1.2
## Parallel computing: 2 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
library( readtext )
library( dplyr )
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library( stringr )
library( tidytext )
library( tidyverse )
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ readr   1.3.1
## ✓ tibble  3.0.3     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library( textdata )


Upload the .txt file containing the 2014 State of the Union address.

data <- readtext("https://raw.githubusercontent.com/datameister66/data/master/sou2014.txt")
data

Quanteda Package

To begin, create a corpus out of the State of the Union document.

sou_corpus <- corpus( data )
summary( sou_corpus )

We can use tokens() to clean up the data.

sou_corpus_edit <- tokens( sou_corpus, remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE )

From there, construct a document-feature matrix (dfm).

We then use the textplot_wordcloud function to create a word cloud featuring the most popular words used in his speech.

sou_dfm <- dfm( sou_corpus_edit, remove = stopwords("english"), remove_punct = TRUE )

textplot_wordcloud( sou_dfm, min_count=5, color = RColorBrewer::brewer.pal( 8, "Blues" ))

While this word cloud is visually appealing, it does not necessarily give exact information regarding the frequency these words appear in the speech . We can use ggplot() to create a frequency chart representing the top words.

sou_features <- topfeatures( sou_dfm, 10 )

top_features <- data.frame(
  list(
    term = names( sou_features ),
    frequency = unname( sou_features )
  )
)

top_features$term <- with( top_features, reorder( term, -frequency ))

ggplot( top_features ) + geom_point( aes( x=term, y=frequency )) + theme_replace() + labs( x = "Word", y = "Frequency" )


Tidytext Package

Another important skill that CPP 527 covered is the importance of tidy data. The tidytext package not only provides ease in cleaning data, but it also features another level of text analysis: sentiment analysis.

Begin by tidying up the 2014 State of the Union Address. This will separate each line into a list of individual words.

tidy_sou <- data %>%
  unnest_tokens( word, text )

tidy_sou

From there, we can begin to perform sentiment analysis on our words.

tidytext uses three different dataset options for classifying sentiment. There is:

  • AFINN
  • bing
  • nrc

AFINN rates each word with a score ranging on a scale from -5 (most negative) to 5 (most positive).

bing sorts each word into negative or positive.

nrc has a broader selection of negative and positive categories: anticipation, disgust, fear, joy, negative, positive, sadness, surprise, and trust.


Use bing to identify the positive words and count how many times they appear in the speech:

positive <- get_sentiments( "bing" ) %>%
  filter( sentiment == "positive" )

tidy_sou %>%
  semi_join( positive) %>%
  count( word, sort = TRUE )
## Joining, by = "word"

Now, identify and count the negative words:

negative <- get_sentiments( "bing" ) %>%
  filter( sentiment == "negative" )

tidy_sou %>%
  semi_join( negative ) %>%
  count( word, sort = TRUE )
## Joining, by = "word"

Use ggplot() to visualize the frequency of positive words compared to negative words.

bing <- get_sentiments( "bing" )

bing_word_counts <- tidy_sou %>%
  inner_join( bing) %>%
  count( word, sentiment, sort = TRUE )
## Joining, by = "word"
bing_word_counts
bing_word_counts %>%
  filter( n > 2 ) %>%
  mutate (n = ifelse( sentiment == "negative", -n, n)) %>%
  mutate( word = reorder( word, n )) %>%
  ggplot( aes( word, n, fill = sentiment )) +
  geom_col() +
  coord_flip() +
  labs( x = "Word", y = "Contribution to sentiment" ) +
  scale_fill_brewer( palette="Set1" )


Take a look at how nrc classifies the words into more specific sentiments.

Trust:

trust <- get_sentiments( "nrc" ) %>%
  filter( sentiment == "trust" )

tidy_sou %>%
  semi_join( trust ) %>%
  count( word, sort = TRUE )
## Joining, by = "word"

Anticipation:

anticipation <- get_sentiments( "nrc" ) %>%
  filter( sentiment == "anticipation" )

tidy_sou %>%
  semi_join( anticipation ) %>%
  count( word, sort = TRUE )
## Joining, by = "word"

Disgust:

disgust <- get_sentiments( "nrc" ) %>%
  filter( sentiment == "disgust" )

tidy_sou %>%
  semi_join( disgust ) %>%
  count( word, sort = TRUE )
## Joining, by = "word"


We can use ggplot() to create a frequency table showing the nrc sentiments as well:

nrc <- get_sentiments( "nrc" )

nrc_word_counts <- tidy_sou %>%
  inner_join( nrc) %>%
  count( word, sentiment, sort = TRUE )
## Joining, by = "word"
nrc_word_counts
nrc_word_counts %>%
  filter( n > 5 ) %>%
  mutate( word = reorder( word, n )) %>%
  ggplot( aes( word, n, fill = sentiment )) +
  geom_col() +
  coord_flip() +
  labs( x = "Word", y = "Contribution to sentiment" )


Works Cited

This code references and cites the following sources: