library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(tinytex)
library(gutenbergr)
library(janeaustenr)
library(tidyverse)
library(RCurl)
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
library(knitr)
library(wordcloud)
## Loading required package: RColorBrewer
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
@book{silge_robinson_text_mining_2017, author = {Julia Silge, David Robinson}, title = {Welcome to Text Mining with R}, publisher = {O“’”Reilly Media, Inc CA}, year = {2017}, isbn = {978-1491981658}, url = {https://github.com/dgrtwo/tidy-text-mining} }
Some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that
I am not having a good day.
is a sad sentence, not a happy one, because of negation. R packages included coreNLP (T. Arnold and Tilton 2016), cleanNLP (T. B. Arnold 2016), and sentimentr (Rinker 2017) are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences, and it makes sense to use a new name for the output column in such a case.
(p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")) # unnest tokens into a field called sentence with
# sentences
(p_and_p_sentences$sentence[2])
## [1] "by jane austen"
The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text, especially with sections of dialogue; it does much better with punctuation in ASCII. One possibility, if this is important, is to try using iconv(), with something like iconv(text, to = ‘latin1’) in a mutate statement before unnesting.
Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.
(austen_chapters <- austen_books() %>% # pipe austen_books to group_by()
group_by(book) %>% # group output by book
unnest_tokens(chapter, text, token = "regex", # unnest tokens by chapters using regex to find
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>% # chapters. each row contains the all
ungroup()) # sentences in a chapter
(austen_chapters %>%
group_by(book) %>%
summarise(chapters = n()))
## `summarise()` ungrouping output (override with `.groups` argument)
(bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative"))
(tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text))
(wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n()))
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
For each book, which chapter has the highest proportion of negative words?
(tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup())
## Joining, by = "word"
## `summarise()` regrouping output by 'book' (override with `.groups` argument)
Import bing sentiment words to use as a look up.
lookup_bing <- get_sentiments("bing")
Import the csv file airline review tweets as found on kaggle.com (https://www.kaggle.com/crowdflower/twitter-airline-sentiment)
# filename <- getURL("https://raw.githubusercontent.com/audiorunner13/Masters-Coursework/main/DATA607%20Spring%202021/Week10/Data/Tweets.csv")
# airline_tweets_src <- read.csv(text = filename,na.strings = "")
filename <- "/Users/Audiorunner13/CUNY MSDS Course Work/DATA607 Spring 2021/Week10/archive/Tweets.csv"
airline_tweets_src <- read.csv(filename)
airline_tweets_src$text <- tolower(airline_tweets_src$text) %>% str_replace("^@[a-z]* ", "")
head(airline_tweets <- airline_tweets_src %>% select(, airline, text, airline_sentiment),10)
What are the most common joy words by airline? 1. We need to take the text of the review and convert the text to the tidy format using unnest_tokens(). 2. Also, set up some other columns to keep track of which line and text of the airline each word comes from 3. We use group_by and mutate to construct those columns.
(tidy_airline_reviews <- airline_tweets %>%
group_by(airline) %>%
mutate(
review = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text))
Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis. What are the most common joy words in Emma? Let’s use count() from dplyr.
(tidy_airline_reviews %>% # pipe tidy_books content to filter()
filter(airline == "Virgin America") %>% # filter on the book Emma
inner_join(lookup_bing) %>% # inner_join() on nrc_joy
count(word, sort = TRUE)) # get a count of each joy word and sort in descending order
## Joining, by = "word"
Count up how many positive and negative words there are for each airline.
We define an index here to keep track of where we are in the narrative; this index (using integer division) counts up sections of 80 lines of text for a better estimate than smaller or larger sections.
Use pivot_wider() so that we have negative and positive sentiment in separate columns.
Calculate a net sentiment (positive - negative).
(twitter_airline_sentiment <- tidy_airline_reviews %>%
inner_join(lookup_bing) %>%
count(airline, sentiment))
## Joining, by = "word"
Multiply the negative counts by -1 for use with ggplot2
x <- 1
while (x < 13){
if (twitter_airline_sentiment$sentiment[x] == "negative"){
twitter_airline_sentiment$n[x] = twitter_airline_sentiment$n[x] * -1
}
x <- x + 1
}
twitter_airline_sentiment
Rename columns for use with ggplot2
twitter_airline_sentiment <- twitter_airline_sentiment %>% rename(Airline = airline, Sentiment = sentiment, Count = n)
Plot negative and postive counts by airline using ggplot2
ggplot(twitter_airline_sentiment, aes(x = Airline, y = Count)) +
geom_bar(
stat = "identity", position = position_stack(),
color = "white", fill = "blue"
) +
labs(title = ("Airline Sentiment Analysis")) +
theme_minimal() +
coord_flip()
As you can see from the plot that of the 6 major airlines United have the most negative reviews and US Airways has almost twice the negative reviews as positive. Southwest, Delta and US Virgin have more positive reviews than negative, however, Virgin America very few reviews compared to the other airlines.
(twitter_airline_sentiment %>%
pivot_wider(names_from = Sentiment, values_from = Count, values_fill = 0) %>%
mutate(Sentiment = positive + negative) %>%
rename(Negative = negative, Positive = positive))
(airline_tweets %>%
group_by(airline) %>%
summarise(texts = n()) %>%
ungroup)
## `summarise()` ungrouping output (override with `.groups` argument)
Summary: It appears that airlines get more negative tweets than positive and I struggle to understand why. I have flown quite a bit for work and for pleasure and it is rare that I have a negative experience. I truly enjoy flying and I often feel for flight attendants as they try their best to accommodate 100+ passengers on most typical flights. I wish persons would be a little more appreciative of the convenience of flying as opposed to having to drive or sail to destinations.
I really enjoyed sentimental analysis despite the issues that I had with my file not importing from github the way it does from my local drive and that I not successful referencing or citing the book in the first part of this exercise.