Understand the task of text classification and learn about its applications
Learn about a basic automated way of text classification
Practice the lexicon-based analysis for sentiment classification of COVID-19 Tweets
Text classification refers to the task of assigning texts with one or more predefined categories. What is important here is “predefined”. An example of text classification is sentiment classification of online reviews, where the text classifer automatically assigns each given review with one of the predefined categories such as positive review versus negative review. So, we already known what to do with the texts that are to be classified into either positive or negative reviews. What we are going to do with text classification is to make algorithms learn about how to classify the texts. As a result, the classifer identifies some linguistic features that are associated with positive sentiment such as the words happy and awesome, whereas the words poor and shit are associated with negative sentiment.
Text clustering is the task of grouping texts into unknown categories; that is, in text clustering, the categories that the text are classifed into are not known a priori. Given a set of texts, a clustering system identifies that certain texts are more similar to one another than others and should be assigned to the same cluster, but it will not label this cluster. And the number of clusters that a collection of texts will be split into is often not known. That is why text clustering is called unsupervised machine learning. On the other hand, text classification is supervised machine learning where the machine is supervised what categories a collection of texts are assigned with.
In the early days, the classification of texts was done manually by “domain experts” who were familiar with the topics of the texts being classified. For example, given our collection of tweets about COVID-19, we can carefully read all the tweets and manually assign each tweets with one or more categories of sentiments. As expected, this classification approach is highly accurate, in particular when the data set was relatively small and the team of annotators was also small so as to avoid inconsistency among annotators. But this approach has a critical limitation, as the number of documents that need to be classified are very large. Imagine how long it will take to read 1,000,000 tweets for their sentiment classification.
The next step in the history of text classification was rule-based systems, which used queries consisting of combinations of words to determine the category of a text. For instance, if a tweet included the words safe and thank, there could be a rule that would say that this text was part of texts expressing positive sentiment. The accuracy of this system is also high, but it suffers from a scalability (applicability) issue because building and maintaining such a rule is an expensive process.
After this, machine learning came into picture, and supervised machine learning became the effective approach at work for text classification. And supervised machine learning uses numerous algorithms that are available for automated classification, ranging from Naive Bayes to decision trees, random forest, and support vector machines (SVMs). These systems come at the cost of annotated (labeled) data, which is required to train the supervised algorithms. For example, there should a set of tweets that are already classified into positive and negative sentiments. From the data, the machine learn a certain pattern of linguistic features that is to be applied to unseen (unlabeld) tweet data that need to be assigned with either positive or negative sentiment.
Last week, we did some lexicon-based sentiment analysis. This approach assumes that the contextual sentiment orientation of text is the sum of the sentiment orientation of emotional words in each text. So, to analyze the sentiment of a text, we consider the sentiment content of the text as the sum of the sentiment content of the individual words. We did so by adding up the individual sentiment scores for each word in the text that is matched with the words in sentiment lexicons.
Lexicon-based sentiment analysis begins with annotating words in text with a type of emotion or its intensity score in sentiment lexicons. Among a number of sentiment lexicons that provide lists of positive and negative words that can be used for evaluating the opinion or emotion in text, we use 1) Bing
, 2) NRC
, and 3) NRC-EIL
, which are available in the textdata
package.
lexicon_bing()
returns the Bing
lexicon as one of the most popular general purpose English sentiment lexicons that categorizes 6,787 words in a binary fashion into positive and negative categories.
lexicon_nrc()
returns the NRC
lexicon, which is also a general purpose English sentiment lexicon. This lexicon labels 13,901 words with 10 possible categories of sentiments or emotions: “positive”, “negative”, “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, and “trust”.
lexicon_nrc_eil()
returns the NRC Emotion Intensity Lexicon (NRC-EIL), which is a list of 5,814 English words and their associations with four basic emotions (anger, fear, sadness, and joy). And for a given word and emotion X, the assigned score ranges from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.
tidytext
and textdata
packages, we can do the lexicon-based sentiment analysis on our tweet data in a tidy format. That is, our tweet data are in a tidy format that each row has a single word from each tweet.
A flowchart of a typical text analysis that uses tidytext for sentiment analysis by Julia Silge
unnest_tokens()
and inner_join()
To perform lexicon-based sentiment analysis, we need to have our data in a tidy format. Using unnest_tokens()
, We’ve already learned how to convert tweets in a csv file format into a tidy data format that has each word per row. When we have a tidy data format for tweets, we are ready to go for lexicon-based sentiment analysis by inner_join()
.