Update Everything!

Packages

#download the Rling file from online
install.packages(file.choose(), repos = NULL, type = "source")
## Installing package into 'C:/Users/Emily/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
#deal with modeest
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("genefilter")
## Bioconductor version 3.9 (BiocManager 1.30.4), R 3.6.1 (2019-07-05)
## Installing package(s) 'genefilter'
## package 'genefilter' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Emily\AppData\Local\Temp\RtmpCiJFOH\downloaded_packages
## installation path not writeable, unable to update packages: boot
## Update old packages: 'curl'
#download from CRAN
install.packages(c("knitr", "modeest", "car", "rms", "visreg", 
                   "googleVis", "party", "pvclust", "LSAfun",
                   "ngram", "tm", "slam", "tidytext", "topicmodels",
                   "tidyverse", "fields", "rgl", "rworldmap", "psych", 
                   "ca", "FactoMineR", "janeaustenr", "dplyr", "stringr", 
                   "tidyr", "ggplot2", "wordcloud", "reshape2"))
## Installing packages into 'C:/Users/Emily/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

How this course works

Generally, lectures will be formatted with:

Let’s jump in!

Data for this course will come in many forms, as language is inherently unstructured. We will mostly use tidy format as defined:

However, we might define an observation as a frequency or proportion of observations or a specific word (rather than a participant or person in a study), etc. Learning how to structure the data for our analyses will be part of the goal of each lecture.

Tokenization

Tokens - meaningful unit of text, often a word, but could be phrases, documents, sentences, etc. Thus, to keep our data in tidy format, we can use one token per row, to treat each token as an observation. Later in the semester, we will use term-by-document matrices and corpus objects, which will be formatted differently.

Tidytext Package

The tidytext package has many tools that we can use to help us analyze text information. Let’s load it and try sentiment analysis with the package.

library(tidytext)

Sentiment Analysis

When we read, we use our understanding of words to help determine meaning. Often the semanticity of words includes their emotional intent. Using information about the valence of words, we can determine if a text is positive or negative (or other emotional descriptors).

Run the code below to see the graphic. Make sure you’ve downloaded the picture and put it in the same folder as this assignment.

The graphic below shows how you might treat a research workflow using tidyverse to analyze sentiment.

knitr::include_graphics("tidyflow-ch-2.png")

Sentiment Lexicon

We are going to examine sentiment as a “sum of parts” - this approach means that we can look sum up the sentiments of individual words to represent the larger text.

head(sentiments)
## # A tibble: 6 x 2
##   word       sentiment
##   <chr>      <chr>    
## 1 2-faces    negative 
## 2 abnormal   negative 
## 3 abolish    negative 
## 4 abominable negative 
## 5 abominably negative 
## 6 abominate  negative

The sentiments dataset as part of tidytext includes three sentiment lexicons: AFINN, bing, and nrc. There are several others we can use including one by Warriner et al., but these provide good coverage of common English words.

The dataset includes:

Important Considerations

The limitation to these datasets is that we have to remember when and how they were validated. One thing we will discuss this semester is the fact the word meanings change over time, so we have to consider time period for each analysis.

Another limitation to this approach is that context is ignored (sometimes this approach is consider “bag-of-words” because words are just tossed into a bag and totalled up). Qualifiers like “no” and “aren’t” are not considered - additionally, sarcasm and idioms will not be captured.

Example Analysis of Jane Austen

For this analysis, we are going to explore Jane Austen novels. You will want to change the parameters of the analysis while exploring the functionality of the code. You should fill in the information where requested - look for instructions in ALL CAPS.

#load the libraries
library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

#specific to this package, pull jane austen books and create a tidy dataframe
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)

Specifically, here check out the unnest_tokens function - word is the output column, while text is the input column.

WHAT DOES IT APPEAR THAT THE UNNEST_TOKENS FUNCTION DID? TRY RUNNING THE CODE WITH AND WITHOUT THE LAST LINE. Answer: Split a column into tokens using the tokenizers package, splitting the table into one-token-per-row. This function supports non-standard evaluation through the tidyeval framework.It will give a tidy format of the text. One token per row. And each token is an observation. ## Joining together

The code provided analyzes the “joy” sentiment in “Emma”.

nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
## Error in match.arg(lexicon): 'arg' should be one of "afinn", "bing", "loughran"
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrc_joy) %>% #this function merges sentiment with the Emma data
  count(word, sort = TRUE) #makes a frequency table 
## Error in tbl_vars_dispatch(x): object 'nrc_joy' not found

EDIT THE CODE TO USE A DIFFERENT EMOTION AND NOVEL.

WHAT ARE THE TOP WORDS IN YOUR EMOTION AND NOVEL?

DO THERE APPEAR TO BE SOME WORDS THAT ARE SURPRISING TO YOU? (I.E. THEY DO NOT SEEM TO MATCH WHAT YOU MIGHT EXPECT TO FIND AS FREQUENT FOR THAT EMOTION) Answer: there is a bug in the code.

Text Size

We should consider the size of text to analyze for sentiment. If we use a whole document, the effects of sections of sentiment (like one sad chapter) may get washed out. However, you may not want to use small sentences because you might miss the larger structure of the text. The suggestion from the book is to use ~ 80 lines of text, and she is pretty smart, so let’s try that.

library(tidyr)

jane_austen_sentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"

WHAT DID THIS CODE APPEAR TO CREATE FOR US? Anwer: This code creates a reasonable text size to analyze for sentiment. And only use around 80 lines of text to avoid missing the larger structure of the text. Every 80 lines in the book will be aggregate as index. Words in each index are analyzed according to the nagtiveness and positiveness. In the end, we will calculate each index’s sentiment.

Plotting Sentiment

We can use ggplot2 to plot the sentiment across the predefined chunks of text. This plot is similar to a lexical dispersion plot, which shows the instances of a word across a text.

library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x") +
  theme_bw()

EXAMINE THE GRAPH - WHAT BOOK APPEARS TO HAVE THE MOST POSITIVE INTERPRETATION? THE MOST NEGATIVE? Persuasion has most nagative.

CHANGE THE NUMBER OF LINES TO SOMETHING SMALLER LIKE 10-20 OR MUCH LARGER LIKE 200 - RERUN THE CODE AND GRAPH. WHAT CHANGES DO YOU SEE? When the line is 20, the data shows more index involved, and can’t tell which one is the most negative and which one is the most positive. When the line is 200, index numbers decreased and each sentiment figure goes up.

Picking a lexicon

The choice in lexicon might be based on word overlap (i.e. it has the words you need) or based on what you want to analyze. Because we have more than one, we can compare them directly.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

head(pride_prejudice)
## # A tibble: 6 x 4
##   book              linenumber chapter word     
##   <fct>                  <int>   <int> <chr>    
## 1 Pride & Prejudice          1       0 pride    
## 2 Pride & Prejudice          1       0 and      
## 3 Pride & Prejudice          1       0 prejudice
## 4 Pride & Prejudice          3       0 by       
## 5 Pride & Prejudice          3       0 jane     
## 6 Pride & Prejudice          3       0 austen

THIS EXAMPLE IS FOR PRIDE & PREJUDICE. CHANGE THE CODE HERE TO USE A DIFFERENT BOOK.

This code pulls each sentiment and merges it with the book chosen above. The plot at the end compares each of the methods. Here it is most appropriate to use the positive and negative categories from NRC to match the bing and AFINN datasets. Otherwise we might not be comparing the same ideas.

afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(score)) %>% 
  mutate(method = "AFINN")
## Error in loadNamespace(name): there is no package called 'textdata'
bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(get_sentiments("bing")) %>%
                            mutate(method = "Bing et al."),
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", 
                                                                 "negative"))) %>%
                            mutate(method = "NRC")) %>%
  count(method, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Error in match.arg(lexicon): 'arg' should be one of "afinn", "bing", "loughran"
bind_rows(afinn, 
          bing_and_nrc) %>%
  ggplot(aes(index, sentiment, fill = method)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y")
## Error in dots_values(...): object 'afinn' not found

EXAMINING YOUR BOOK - DO THE THREE SOURCES APPEAR TO AGREE? WHAT ARE THE MAJOR DIFFERENCES OR SIMILARITIES? error in code. ## Find polarity

Let’s figure out the most common positive and negative words across all of Jane Austen’s texts.

bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
## Joining, by = "word"
head(bing_word_counts)
## # A tibble: 6 x 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 miss   negative   1855
## 2 well   positive   1523
## 3 good   positive   1380
## 4 great  positive    981
## 5 like   positive    725
## 6 better positive    639

LOOK AT THE TOP WORDS HERE. WHY MIGHT A FEW OF THESE BE PROBLEMATIC/MISINTERPRETED? THINK ABOUT THE STYLE OF WRITING FOR THESE NOVELS. Answer: the word “miss” apears the most. Miss is a negative word, it may show that the novel maybe a sad story. Let’s plot that analysis to easier viewing:

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip() + 
  theme_bw()
## Selecting by n

CHANGE THE CODE ABOVE TO REFLECT ONLY ONE OF THE NOVELS IN THE DATASET. WHAT DO YOU OBSERVE ABOUT THE MOST USED POSITIVE AND NEGATIVE WORDS? The pattern of both is similar. But negative words is much more than the positive words, since this novel is not a happy story. ## More visualization

Word clouds are a popular visualization tool for text analysis. We can use the wordcloud library to help us create those plots. This analysis ignores stop words which are common words that appear a lot like “the, an, of, into”.

library(wordcloud)
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>% 
  count(word) %>%
  with(wordcloud(word, n, max.words = 50))
## Joining, by = "word"

It might be more interesting though to compare positive versus negative in the same plot:

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
    filter(book == "Pride & Prejudice")%>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 50)
## Joining, by = "word"

EDIT THE ABOVE CODE TO ONLY INCLUDE ONE OF THE BOOKS. WHAT DO YOU FIND TO BE THE MOST POSITIVE AND NEGATIVE WORDS IN YOUR BOOK (SHOULD MATCH ABOVE). I used Pride and Prejudice as a sigle book. Then the most positive words are good, well, great. And the most negative words is miss. And interesting thing is that in this novel, positive words is more than negative words, because Pride and Prejudice has a happy ending. ## The end

You now have the skills to explore a set of text for positive and negative sentiment! You can apply these ideas to many types of text. In a future session, we will explore Twitter word usage and sentiment.

To turn in this assignment, hit KNIT at the top. You will submit the report in html/pdf/word format (default is html) on Moodle for credit. Be sure you have answered the questions. Great job!