The average for Group 1 is 34.6 and average for Group 2 is 40.6
Some students, although they complete DataCamp assignments, were completely clueless during exam. It’s obvious that some of you completely ignored the instructor’s warning about writing code from scratch. Also, it looks like some cheating going on with DataCamp assignments. For those who are cheating, I wonder about your boundaries, if you cheat for such small thing, for how big an item can you cheat?
Due to DataCamp case, the contribution of grade points is planned to be as follows (subject to change)
| Item | Total contribution to 100 points | 
|---|---|
| Midterm | 30 | 
| Final | 40 | 
| Quiz | 15 | 
| DataCamp assignments | 5 | 
| Question Pool | 5 | 
| Attendance | 5 | 
| Project (Bonus) | 7 | 
The contents are taken from the book Text Mining with R which can be accessed online here. The R code of the book is available at this Github repo
Before we start, please make sure the following libraries are installed
library(tidytext)
library(stringr)
library(tidyverse)Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham, tidy data has a specific structure:
We thus define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.
As we stated above, we define the tidy text format as being a table with one-token-per-row. Structuring text data in this way means that it conforms to tidy data principles and can be manipulated with a set of consistent tools. This is worth contrasting with the ways text is often stored in text mining approaches.
unnest_tokens functionHere’s a sample text.
text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")
text[1] "Because I could not stop for Death -"   "He kindly stopped for me -"             "The Carriage held but just Ourselves -"
[4] "and Immortality"                       In order to turn it into a tidy text dataset, we first need to put it into a data frame.
The two basic arguments to unnest_tokens used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case). Remember that text_df above has a column called text that contains the data of interest.
After using unnest_tokens, we’ve split each row so that there is one token (word) in each row of the new data frame; the default tokenization in unnest_tokens() is for single words, as shown here. Also notice:
unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).Having the text data in this format lets us manipulate, process, and visualize the text using the standard set of tidy tools, namely dplyr, tidyr, and ggplot2.
Let’s use the text of Jane Austen’s 6 completed, published novels from the janeaustenr package, and transform them into a tidy format. The janeaustenr package provides these texts in a one-row-per-line format, where a line is this context is analogous to a literal printed line in a physical book. Let’s start with that, and also use mutate() to annotate a linenumber quantity to keep track of lines in the original format and a chapter (using a regex) to find where all the chapters are.
Here are the names of the books and number of lines they contain
In order to orient ourselves, here’s the first 30 lines from the book Sense and Sensibility
SENSE AND SENSIBILITY
by Jane Austen
(1811)
CHAPTER 1
The family of Dashwood had long been settled in Sussex.  Their estate
was large, and their residence was at Norland Park, in the centre of
their property, where, for many generations, they had lived in so
respectable a manner as to engage the general good opinion of their
surrounding acquaintance.  The late owner of this estate was a single
man, who lived to a very advanced age, and who for many years of his
life, had a constant companion and housekeeper in his sister.  But her
death, which happened ten years before his own, produced a great
alteration in his home; for to supply her loss, he invited and received
into his house the family of his nephew Mr. Henry Dashwood, the legal
inheritor of the Norland estate, and the person to whom he intended to
bequeath it.  In the society of his nephew and niece, and their
children, the old Gentleman's days were comfortably spent.  His
attachment to them all increased.  The constant attention of Mr. and
Mrs. Henry Dashwood to his wishes, which proceeded not merely from
interest, but from goodness of heart, gave him every degree of solid
comfort which his age could receive; and the cheerfulness of the
children added a relish to his existence.And now, the tidy version of all book contents
tidy_books <- original_books %>%
  unnest_tokens(word, text)
tidy_booksThis function uses the tokenizers package to separate each line of text in the original data frame into tokens. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.
Let’s count most frequent words
tidy_books %>%
  count(word, sort = TRUE) Ooops, this list does not give any sort of usable information. Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join().
data(stop_words)
tidy_books <- tidy_books %>%
  anti_join(stop_words)Joining, by = "word"tidy_booksLet’s check the word counts again.
tidy_books %>%
  count(word, sort = TRUE) Since we have the data in tidy format, it would be a breeze to pipe it to ggplot and produce nice plots.
library(ggplot2)
tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
#  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()Don’t forget the fix the plot above.
About, ordering columns, please draw the plot below in ordered fashion
data_frame(letters=letters[1:4], counts=c(20,30,15,5)) %>% 
  # copy/paste the mutate code from above and modify to order bars
  ggplot(aes(letters,counts)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()# please check the letters column classes
ordered <- data_frame(letters=letters[1:4], counts=c(20,30,15,5)) %>% 
  mutate(letters = reorder(letters, counts)) 
unordered <- data_frame(letters=letters[1:4], counts=c(20,30,15,5))Please refer to other examples of word frequency calculation from Project Gutenberg in the book ( Chapter 1.4 )
sentiments datasetsentimentssentiments %>% 
  group_by(word) %>% 
  mutate(n_sent=n()) %>% 
  arrange(-n_sent,word)The three general-purpose lexicons are
AFINN from Finn Årup Nielsen,bing from Bing Liu and collaborators, andnrc from Saif Mohammad and Peter Turney.tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.
get_sentiments("afinn")get_sentiments("bing")get_sentiments("nrc")With data in a tidy format, sentiment analysis can be done as an inner join. This is another of the great successes of viewing text mining as a tidy data analysis task; much as removing stop words is an antijoin operation, performing sentiment analysis is an inner join operation.
Let’s look at the words with a joy score from the NRC lexicon. What are the most common joy words in Emma?
nrcjoy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
tidy_books %>%
  filter(book == "Emma") %>%
  inner_join(nrcjoy) %>%
  count(word, sort = TRUE)Joining, by = "word"Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts, how long the lines were to start with, etc. We then use spread() so that we have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).
The
%/%operator does integer division (x %/% yis equivalent tofloor(x/y)) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in.
library(tidyr)
janeaustensentiment <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  mutate(index= linenumber %/% 80) %>% 
  count(book, index , sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)Joining, by = "word"Now we can plot these sentiment scores across the plot trajectory of each novel. Notice that we are plotting against the index on the x-axis that keeps track of narrative time in sections of text.
library(ggplot2)
ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment.
bing_word_counts <- tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()Joining, by = "word"bing_word_countsThis can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames.
There’s a problem with the word miss. In the context of Jane Austen’s books, it means Miss. as Mr. and Mrs.. Please refer to Section 2.4 to see how a particular word can be removed by adding to stop_words.
bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()Selecting by nWe’ll be having a quiz from the course contents, both today’s lecture and DataCamp course.