The nrc lexicon has been loaded for you. Include code to group the data by sentiments and visualize it with a bargraph. Write few sentences to describe the graph to a non- data scientist.
sent_df<-get_sentiments("nrc") %>%
filter(sentiment %in% c("positive", "negative", "anger", "anticipation", "joy", "disgust", "fear", "sadness", "surprise", "trust")) %>%
count(sentiment)
sent_df
## # A tibble: 10 x 2
## sentiment n
## * <chr> <int>
## 1 anger 1246
## 2 anticipation 837
## 3 disgust 1056
## 4 fear 1474
## 5 joy 687
## 6 negative 3318
## 7 positive 2308
## 8 sadness 1187
## 9 surprise 532
## 10 trust 1230
Sentiment analysis, also known as opinion mining, is a type of data mining that measures the inclination of peoples opinions through natural language processing (NLP), computational linguistics and text analysis. These processes help us extract and analyze subjective information from the Web. More specifically a sentiment is an adjective category in which the words or sentences in the text can fall in. There are three main types of sentiment data lexicons (vocabulary of a person, language, or branch of knowledge) which include, AFINN, bing, and NRC. All of these are split into different sentiment (adjectives) categories. The one that is of interest to us for this analysis is the nrc data which splits the sentiments into ten categories which are, anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, and positive. Before the data was loaded in, we wanted to see the number of words that were found in positive and negative sentiments of this lexicon. This will help us see if the positive and negative word counts in the NRC lexicon match the words in the janeaustenr data set that will be used throughout this analysis. Knowing this information, helps us choose the appropriate lexicon to use for the data set. However, it was already stated in the prompt that the NRC lexicon will be used for this analysis. We see from this table that there are ten separate sentiments with the corresponding number of words in each of those sentiments to the right.
tidy_books<- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
This code was used to convert the austen_books() text into the tidy format using the unnest_tokens() function. The group_by() function was used to separate the books in the Austen data. This function along with the mutate() function were used to set up other columns to help keep track of which line and chapter of the book each specific word comes from.
#same code as the first chunk, only with the count() function removed.
sent_df<-get_sentiments('nrc')%>%
filter(sentiment %in% c("positive", "negative"))
tidy_books%>%
inner_join(sent_df)%>%
count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 2,682 x 2
## word n
## <chr> <int>
## 1 good 1380
## 2 mother 1082
## 3 dear 822
## 4 sir 806
## 5 immediately 646
## 6 hope 601
## 7 friend 593
## 8 happy 534
## 9 love 495
## 10 feeling 494
## # … with 2,672 more rows
The data that has been chosen for this analysis looks at all the Jane Austen novels. Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, lets use the NRC lexicon and filter() function for the positive and negative words. Next, we can use the filter() function to filter the data frame with the text from the books for the words from the novels and then use inner_join() to perform the sentiment analysis. The count() function was used to see which words are the most common words in the positive and negative categories. On the left side of the table are the words found in the novels and the right side shows the amount of times the corresponding words appear in the text.
nrc_word_counts <- tidy_books %>%
inner_join(get_sentiments("nrc")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
nrc_word_counts
## # A tibble: 6,830 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 good anticipation 1380
## 2 good joy 1380
## 3 good positive 1380
## 4 good surprise 1380
## 5 good trust 1380
## 6 time anticipation 1337
## 7 thought anticipation 839
## 8 dear positive 822
## 9 sir positive 806
## 10 sir trust 806
## # … with 6,820 more rows
One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment. This data table lists the word, the corresponding sentiment it is found in, along with the number of times this word is used. We see here that the word “good” is found in the anticipation, joy, positive, surprise, and trust sentiment categories. We would expect there to be plenty of overlap with this word since it can be used in various contexts. In each of these sentiment categories, the word “good” appears 1,380 times.
nrc_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)
Bar charts are used to compare the frequency, total count, sum, or an average of data in different categories by using horizontal or vertical bars. Bar charts help us visualize and compare the various sentiment categories. The code above was used to create the “nrc_word_counts” label using the data from the tidy_books data base. After the words were counted, separated into their own categories, and sorted from most commonly occurring to least, the following code was used to create the six bar charts containing the top ten most common words for each corresponding sentiment. These ten words make up the y-axis, contribution to the sentiment, or number of times the words appear in the text, is represented on the x-axis. For example, the ten most common words (starting to most common to least common) that appear in the “anger” sentiment category are, ill, feeling, words, bad, bear, spite, evil, distress, money, and loss. Some of these words also appear in other categories like, sadness, fear, disgust, and negative. Looking at the table in the previous section directly above, we can see that in the joy, positive, trust, surprise, and anticipation categories that the word “good” is represented by the top bar as appearing 1,380 times, making it the most commonly occurring word in these three categories. However, as will see in the next code, the word “good” can be labeled as a stop word. Stop words are words that are very common in the language we are using (English in this case) that they do not give any context or help us further infer anything from the text. Since the word “good” appears so many times, in many different sentiments, does not offer much context, and is used so often in English language we will consider it a stop word and remove it from the analysis.
Discuss the word cloud thoroughly to someone who is not a data scientist.
word_counts <- tidy_books%>%
inner_join(get_sentiments("nrc")) %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, scale=c(3.5,0.25), max.words =100))
## Joining, by = "word"
## Joining, by = "word"
A word cloud is a picture representation of the arrangement of words that highlights the most frequently used or repeated words. In this type of representation, the words that are larger and bolder appear most frequently. On the other hand, that do not appear as frequently are smaller and lighter in color. The location of the words in the cloud, does not relate at all how frequently the word appears in the dataset. After removing the stop words from the tidy_books library of all the Jane Austen books, the word “mother” is the largest word appearing in this cloud and in bold coloring. This means that this word is the most commonly occurring word in this text. The word “hope” is in a slightly lighter font and slightly smaller, leading to the conclusion that this is the second most common word in the data set. Note that only a max of 100 words were represented in this image. Below we see how this cloud can be made into an even more visually appealing picture using the wordcloud2 function.
Discuss the word cloud shown below thoroughly.
wordcloud2 <- tidy_books %>%
inner_join(get_sentiments("nrc")) %>%
anti_join(stop_words)%>%
count(word)
## Joining, by = "word"
## Joining, by = "word"
wordcloud2(data = wordcloud2, size = 1,
shape = "star",
rotateRatio = 0.5,
ellipticity = 0.9, color = "random-light", backgroundColor="black")
The code above is another type of world cloud that generates images that are more pleasing to the eye. We can change the shape of arrangement and color of the words as well as the background color. Just as in the word cloud in the previous code, the words that have a bolder font color and larger are more common than words that are not as bold and smaller. There is no max word count in this image. The placement of these words are also random, however, it appears that the more common words are placed on the border and the least common words fade towards the center.