Data607 - Week 10 - Sentiment Analysis

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(tidytext)
library(tinytex)
library(gutenbergr)
library(janeaustenr)
library(tidyverse)
library(RCurl)

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

library(knitr)
library(wordcloud)

## Loading required package: RColorBrewer

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

Recreate and analyze primary code from textbook Welcome to Text Mining with R [@silge_robinson_text_mining_2017]. Recreating the code to analyze sentence sentimentality.

@book{silge_robinson_text_mining_2017, author = {Julia Silge, David Robinson}, title = {Welcome to Text Mining with R}, publisher = {O“’”Reilly Media, Inc CA}, year = {2017}, isbn = {978-1491981658}, url = {https://github.com/dgrtwo/tidy-text-mining} }

Recreating the code from Chapter 2 for sentence sentiment analysis

Some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that

I am not having a good day.

is a sad sentence, not a happy one, because of negation. R packages included coreNLP (T. Arnold and Tilton 2016), cleanNLP (T. B. Arnold 2016), and sentimentr (Rinker 2017) are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case.

(p_and_p_sentences <- tibble(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences"))  # unnest tokens into a field called sentence with

                                                       # sentences

(p_and_p_sentences$sentence[2])

## [1] "by jane austen"

The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text, especially with sections of dialogue; it does much better with punctuation in ASCII. One possibility, if this is important, is to try using iconv(), with something like iconv(text, to = ‘latin1’) in a mutate statement before unnesting.

Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.

(austen_chapters <- austen_books() %>%  # pipe austen_books to group_by()
  group_by(book) %>%                    # group output by book
    unnest_tokens(chapter, text, token = "regex",    #  unnest tokens by chapters using regex to find
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%  # chapters.  each row contains the all
  ungroup())                                                 # sentences in a chapter

(austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n()))

## `summarise()` ungrouping output (override with `.groups` argument)

Let’s get the list of negative words from the Bing lexicon.

(bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative"))

(tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(
    linenumber = row_number(),
    chapter = cumsum(str_detect(text, 
                                regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text))

Make a data frame of how many words are in each chapter so we can normalize for the length of chapters.

(wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n()))

## `summarise()` regrouping output by 'book' (override with `.groups` argument)

Find the number of negative words in each chapter and divide by the total words in each chapter.

For each book, which chapter has the highest proportion of negative words?

(tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  slice_max(ratio, n = 1) %>% 
  ungroup())

## Joining, by = "word"

## `summarise()` regrouping output by 'book' (override with `.groups` argument)

Import another lexicon (From twitter on airline sentiment)

Import bing sentiment words to use as a look up.

lookup_bing <- get_sentiments("bing")

Import the csv file airline review tweets as found on kaggle.com (https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

# filename <- getURL("https://raw.githubusercontent.com/audiorunner13/Masters-Coursework/main/DATA607%20Spring%202021/Week10/Data/Tweets.csv")
# airline_tweets_src <- read.csv(text = filename,na.strings = "")
filename <- "/Users/Audiorunner13/CUNY MSDS Course Work/DATA607 Spring 2021/Week10/archive/Tweets.csv"
airline_tweets_src <- read.csv(filename)

airline_tweets_src$text <- tolower(airline_tweets_src$text) %>% str_replace("^@[a-z]* ", "")

head(airline_tweets <- airline_tweets_src %>% select(, airline, text, airline_sentiment),10)

What are the most common joy words by airline? 1. We need to take the text of the review and convert the text to the tidy format using unnest_tokens(). 2. Also, set up some other columns to keep track of which line and text of the airline each word comes from 3. We use group_by and mutate to construct those columns.

(tidy_airline_reviews <- airline_tweets %>%
  group_by(airline) %>%
  mutate(
    review = row_number())  %>%
  ungroup() %>%
  unnest_tokens(word, text))

Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis. What are the most common joy words in Emma? Let’s use count() from dplyr.

(tidy_airline_reviews %>%      # pipe tidy_books content to filter()
  filter(airline == "Virgin America") %>%     # filter on the book Emma
  inner_join(lookup_bing) %>%        #  inner_join() on nrc_joy
  count(word, sort = TRUE))      #  get a count of each joy word and sort in descending order

## Joining, by = "word"

Count up how many positive and negative words there are for each airline.

We define an index here to keep track of where we are in the narrative; this index (using integer division) counts up sections of 80 lines of text for a better estimate than smaller or larger sections.

Use pivot_wider() so that we have negative and positive sentiment in separate columns.

Calculate a net sentiment (positive - negative).

(twitter_airline_sentiment <- tidy_airline_reviews %>%  
  inner_join(lookup_bing) %>%
  count(airline, sentiment))

## Joining, by = "word"

Multiply the negative counts by -1 for use with ggplot2

x <- 1
while (x < 13){
  if (twitter_airline_sentiment$sentiment[x] == "negative"){
     twitter_airline_sentiment$n[x] = twitter_airline_sentiment$n[x] * -1
  }
  x <- x + 1
}
twitter_airline_sentiment

Rename columns for use with ggplot2

twitter_airline_sentiment <- twitter_airline_sentiment %>% rename(Airline = airline, Sentiment = sentiment, Count = n)

Plot negative and postive counts by airline using ggplot2

ggplot(twitter_airline_sentiment, aes(x = Airline, y = Count)) +
  geom_bar(
    stat = "identity", position = position_stack(),
    color = "white", fill = "blue"
  ) + 
  labs(title = ("Airline Sentiment Analysis")) +
      theme_minimal() +
  coord_flip()

As you can see from the plot that of the 6 major airlines United have the most negative reviews and US Airways has almost twice the negative reviews as positive. Southwest, Delta and US Virgin have more positive reviews than negative, however, Virgin America very few reviews compared to the other airlines.

(twitter_airline_sentiment %>% 
pivot_wider(names_from = Sentiment, values_from = Count, values_fill = 0) %>% 
  mutate(Sentiment = positive + negative) %>% 
  rename(Negative = negative, Positive = positive))

(airline_tweets %>% 
  group_by(airline) %>% 
  summarise(texts = n()) %>%
  ungroup)

## `summarise()` ungrouping output (override with `.groups` argument)

Summary: It appears that airlines get more negative tweets than positive and I struggle to understand why. I have flown quite a bit for work and for pleasure and it is rare that I have a negative experience. I truly enjoy flying and I often feel for flight attendants as they try their best to accommodate 100+ passengers on most typical flights. I wish persons would be a little more appreciative of the convenience of flying as opposed to having to drive or sail to destinations.

I really enjoyed sentimental analysis despite the issues that I had with my file not importing from github the way it does from my local drive and that I not successful referencing or citing the book in the first part of this exercise.

Data607 - Week 10 - Sentiment Analysis

Peter Gatica

04-18-2021

Recreate and analyze primary code from textbook Welcome to Text Mining with R [@silge_robinson_text_mining_2017]. Recreating the code to analyze sentence sentimentality.

Recreating the code from Chapter 2 for sentence sentiment analysis

Import another lexicon (From twitter on airline sentiment)