Machne Learning, Text Classification

Welcome to the Machine Learning and Text Analysis Module. A few of my research projects involves analyzing text data that is floating out there on the internet. One such source are Tweets. Tweets are one source of data that we can examine to get a sense of individual’s perceptions of and concerns about an array of topics. While Tweets may not be representative of perceptions and beliefs of the general public, they one source of views. In this module, we will examine a set of tweets that I have examined as a part of my research project on recovery from the Colorado Floods of 2013.

We collected all social media (Tweets) that was created and disseminated in the three-year period following the 2013 floods. Approximately 100,000 tweets containing the hashtag #coflood were collected over the three-year period. Retweets, non-English tweets, and tweets with no English-language content were removed, leaving approximately 20,000 tweets for analysis. The tweets were pre-processed. Stop words were removed from the tweets before analysis, and tweets that contained no words beyond ‘stop words’ were removed from the corpus.

In this module, you will step through a classification task of authors of the tweets. Before we get going, please install the packages ‘quanteda’ and quanteda.textmodels. This is one of the popular packages for text analysis. Quanteda stands for the Quantitative Analysis of Textual Data. Please look over the following website about the structure of text data, the corpus and the Document Feature Matrix (DFM): https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda/

When you have completed the module, please upload your Rmd and html to the Optional Modules link.

Please run the chunk below to load the packages.

library(quanteda)

## Warning: package 'quanteda' was built under R version 4.0.3

## Package version: 2.1.2

## Parallel computing: 2 of 12 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:utils':
## 
##     View

library(quanteda.textmodels)

## Warning: package 'quanteda.textmodels' was built under R version 4.0.3

## 
## Attaching package: 'quanteda.textmodels'

## The following object is masked from 'package:quanteda':
## 
##     data_dfm_lbgexample

library(RColorBrewer)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The first goal of this tutorial is to classify the tweets by author type. We want to know if the public and experts are tweeting differently (have different content) in their tweets. We will classify the author category of each tweet using a supervised machine learning approach in R.

In the classification, three author categories were used:

elites, which includes scientists, experts, academics, firms, organizations and mainstream media outlets;
policy elites, which includes policy makers at all levels of governance; and
the general public.

Supervised learning means that we will train our model on hand-coded tweets (the training dataset) and then test our model on a testing dataset to see how well the model performs. I handcoded approximately 10,000 of the bios of the tweets by hand. This means that I read each bio and decided, based on the bio, whether the person was an elite, policy elite or the general public. This took a bit of time, as you might imagine. Please look over the cleantweets.csv spreadhsheet and check out the bios (variable: bio) and the classification (variable: code). You will see that many of the bios are repeats (e.g., one individual tweeted multiple times). We will keep the bio replicates in, recognizing that that may lead to some bias in our modeling (lack of meeting the independence of observations assumption).

Please read a bit more about supervised vs unsupervised machine learning here: https://www.guru99.com/supervised-vs-unsupervised-learning.html.

Question 1 Discuss the difference between supervised and unsupervised learning (a short paragraph will do). (2 points) Supervised learning is a model where the algorithm “learns” and absorbs data from explicit labels and answer keys. While unsupervised learning is a model where an algorithm that is self-learning, and extracts data and trends which aren’t explicitly stated.

Reading in the Data

First we will read the tweet data from a .cvs file. We don’t want the strings as factors. Please run the chunk below to read in the data. In the chunk below, use R code to look at the first 100 bios. You may first need to look glimpse the data so you are aware of the variable names and types.

Question 2 Look over the first 100 bios in the dataframe. In a paragraph discuss the bios. Do you notice any themes in the bios? Any challenges to analyzing the bios? (2 points) This is a unsupervised learning example! I have to make sense of bios without categories or a key, just a collection of random data. I notice hashtags, ‘@’, they’re usually one-sentence, usually under 20 words. Colorado is a key word that came up frequently, as well as environment, and news networks, and citizens.

tweets.df<-read.csv("cleantweets.csv", stringsAsFactors=FALSE)

glimpse(tweets.df)

## Rows: 16,734
## Columns: 18
## $ random           <dbl> 14.620062, 17.674565, 17.712783, 18.998703, 19.724...
## $ twitterid        <dbl> 3.79e+17, 3.79e+17, 3.80e+17, 3.82e+17, 3.79e+17, ...
## $ date             <dbl> 41532.10, 41531.93, 41535.09, 41541.11, 41532.11, ...
## $ tweet            <chr> "\"@KellySommariva: Holy moly! - look at this shot...
## $ retweetcount     <int> 0, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 10...
## $ favoritecount    <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1,...
## $ userid           <dbl> 607597467, 607597467, 607597467, 607597467, 607597...
## $ code             <chr> "elite", "elite", "elite", "elite", "elite", "elit...
## $ realname         <chr> "CodeRed Fire&amp;Safety", "CodeRed Fire&amp;Safet...
## $ username         <chr> "CodeCodeRed_1", "CodeCodeRed_1", "CodeCodeRed_1",...
## $ location         <chr> "Colorado, USA", "Colorado, USA", "Colorado, USA",...
## $ bio              <chr> "Fire/Flood &amp; Safety Mitigation ~x~ A Division...
## $ URL              <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""...
## $ Followers        <int> 248, 248, 248, 248, 248, 248, 248, 248, 248, 248, ...
## $ Following        <int> 1077, 1077, 1077, 1077, 1077, 1077, 1077, 1077, 10...
## $ No..of.Tweets    <int> 1206, 1206, 1206, 1206, 1206, 1206, 1206, 1206, 12...
## $ Verified.Account <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N...
## $ X                <chr> "", "", "", "", "", "", "", "", "", "", "", "", ""...

Cleaning the Data

As you saw in the data, the tweet bios are quite messy with a lot of characters that don’t add much meaning and may muddle the analysis. We want to remove all funky characters. We can use the function gsub() in base R to clean up the data. We will set up a function in R to do the cleaning for us. That function is name removeSpecialChars. That’s a name that I made up. You could call it removeTHeCrap or whatever. The gsub function has the arguments:

gsub(old, new, string)

So let’s practice, not on the tweet data. Below we make a vector with two items, “I hate statistics” and “I love the outdoors”.

stats<-c("I hate statistics", "I love the outdoors.")
stats

## [1] "I hate statistics"    "I love the outdoors."

In the next section we will use gsub to replace hate with love.

stats.new<-gsub("hate", "love", stats)
stats.new

## [1] "I love statistics"    "I love the outdoors."

Question 3 In the chunk below, develop a vector with two sentences as I did above with stats. Use the gsub() function to replace one or more of the words in the sentences. (2 points)

computer<-c("Computers make me blind", "The sun makes me smart.")
computer

## [1] "Computers make me blind" "The sun makes me smart."

computer.new<-gsub("blind", "smart", computer)
computer.news<-gsub("smart.", "blind", computer.new)
computer.news

## [1] "Computers make me smart" "The sun makes me blind"

In the chunk below, I will use gsub to remove characters that are non-alphanumeric.

tweets.df$bio.clean<-gsub("[^a-zA-Z0-9 ]", " ", tweets.df$bio)
head(tweets.df, 100)

Question 4 Look over the first few bios and compare bio and bio.clean. Discuss what some of the characters that were removed. (2 points) The “….” were removed in bio clean; the return lines were removed, n were added.

Randomly shuffling the dataframe

In the next step is that we will randomly shuffle data frame so we can later divide it into testing and training data. We will use the sample() function to do this. We are random sampling for a sample size of the number of rows in tweets.df

set.seed(1234) # sets a seed number so we can replicate the random shuffle
tweets.df<-tweets.df[sample(nrow(tweets.df)),]

Structuring the Data

The next step is to structure the data (tweets.df\(corpus\)bio) into a corpus. We can do this with the corpus() function in the Quanteda package. Please read about a text corpus here: https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda/. When you run the chunk below, you will see what the corpus includes.

Question 5 Describe the contents of the corpus after running the chunk below. How many documents (each tweet is a document) are in the corpus? Please describe in a sentence or two. (2 points) There are 16,734 documents/tweets in the corpus.

bio.clean.corpus<-corpus(tweets.df$bio.clean)
summary(bio.clean.corpus)

Naive Bayes Classifier

We will use a Naïve Bayes in R (package Quanteda) machine learning modeling approach to classify the clean tweets. To understand how the multinomial Naive Bayes process works on text, please watch this short video here: https://www.youtube.com/watch?v=O2L2Uv9pdDA

Question 6 After watching the video on Naive Bayes Machine Learning Classificantion, please write a few sentences describing the intuition of the process to classify text into categories (the video talks about classifying emails as spam). (2 points)

To develop our Naive Bayes model, We will tell R the class labels. This is the hand-coding that I did into three classes (public, elite, and policy elite). Please look above in the introduction to the module for definitions of the classes.

Using the docvars() function, we will add the “code” from the tweets.df dataframe. The function docvars() sets a document level variable. In our case, each tweet is considered a document. We are setting the code for each document. When you run the chunk below, you should see that code is not included in the corpus.

docvars(bio.clean.corpus, "code")<-tweets.df$code
summary(bio.clean.corpus)

Machine Learning and Testing Datasets.

Great! We now have a corpus of tweets. We need to make a training dataset with which to train our model on how to classify the bios (based on the hand-coded tweets) and a testing dataset to see how well our model performed. Below we will set the training dataset to the first 1000 tweets (we randomly shuffled them earlier), and then then we will test on the rest. We will make two new dataframes: tweets.train.df and tweets.test.df.

tweets.train.df<-tweets.df[1:1000,]
tweets.test.df<-tweets.df[1001:nrow(tweets.df),]

Now we will make a document frequency matrix. A DFM is a matrix of rows of documents (in this case tweets/bios) and columns of features (in this case words in the bios). Each column contains the frequency of words for how many times the word appears in each of the bios. We will use the function dfm(). Check out this site to learn more about a document frequency matrix: https://data.library.virginia.edu/a-beginners-guide-to-text-analysis-with-quanteda. We will also make sure that all of the words are in lower case so capitalization doesn’t influence analysis. We will make the dfm from the corpus.

bio.clean.dfm <- dfm(bio.clean.corpus, tolower = TRUE)

This next chunk trims the matrix based on minimal criteria. In essence, we only want to include words that appear at least five times across all the tweets and in at least three documents (tweets).

bio.clean.dfm <- dfm_trim(bio.clean.dfm, min_termfreq = 5, min_docfreq = 3)

Next we need to split the dfm into training and testing sets.We want this set up as we did the dataframes above.

bio.clean.dfm.train<-bio.clean.dfm[1:1000,]
bio.clean.dfm.test<-bio.clean.dfm[1001:nrow(tweets.df),]

We will now develop our Naive Bayes classifier using the function textmodel_nb. Naive Bayes uses Bayes theorem to calculate the probability that a bio falls into each class (public, elite, policy elite), given values of X (document term matrix). The second argument is the vector of training labels. To read more about the function textmodel_nb, check out: https://tutorials.quanteda.io/machine-learning/nb/. and/or https://www.rdocumentation.org/packages/quanteda/versions/1.5.2/topics/textmodel_nb

We are training the model with the dfm and the code from the tweets.train.df. We will set the priors (prior belief about probability of each of the classifications at 0.333) and use a multinomial distribution.

nb.classifier<-textmodel_nb(bio.clean.dfm.train, tweets.train.df$code)
nb.classifier

## 
## Call:
## textmodel_nb.dfm(x = bio.clean.dfm.train, y = tweets.train.df$code)
## 
##  Distribution: multinomial ; priors: 0.3333333 0.3333333 0.3333333 ; smoothing value: 1 ; 1000 training documents;  fitted features.

Now, based on our model, we need to make predictions about the classifications on the test dfm. We will call the precitions pred and will place them in the test dataframe named tweets.test.df.

tweets.test.df$pred<-predict(nb.classifier, bio.clean.dfm.test)

Awesome! Now we need to see how it performed. Fingers crossed!

Now we can make a confusion matrix (think back to the logistic regression video).The numbers on the diagonal were properly classified. The counts not on the diagonal were not properly classified. Here is a guide to confusion matrices: https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/.

confusion.matrix<-table(predicted=tweets.test.df$pred, actual=tweets.test.df$code)
confusion.matrix

##               actual
## predicted      elite policy elite public
##   elite         6253          150    889
##   policy elite   142         3321    139
##   public         605           45   4190

We can now calculate sensitivity and specificity. TP = True Positive FN = False Negative (should have coded elite but did not)

Sensitivity of each class can be calculated TP/(TP+FN)

Question 7 Based on the definition of sensitivity and the output above, calculate the sensitivity for the elite classification in the chunk below.(2 points)

Specificity of each class can be calculated TN/(TN+FP) TN of elite (True Negative) is all non-elite occurrences that were not classified as elite. FP of elite include all those that were predicted to be elite but were not elite.

Question 8 Calculate the specificity based on the output above. (2 points)

And we can Calculate the overall accuracy of the predictions. I did this by taking the mean number (proportion) of occurences of when the predicted values (pred) equal the code (hand-coded values).

accuracy<-mean(tweets.test.df$pred==tweets.test.df$code)*100
accuracy

## [1] 87.47934

Question 9 Rerun the steps above, but instead of training the model on 1000 tweets, train the data on 5,000 tweets.You will need to change this numbers in a couple of different places. In the chunk below develop the new confusion matrix. Also in the chunk below, calculate the specificity and sensitivity of the elite classification and overall accuracy. Did the accuracy improve? (4 points)

tweets2.train.df<-tweets.df[1:5000,]
tweets2.test.df<-tweets.df[5001:nrow(tweets.df),]
bio2.clean.dfm <- dfm(bio.clean.corpus, tolower = TRUE)
bio2.clean.dfm <- dfm_trim(bio.clean.dfm, min_termfreq = 5, min_docfreq = 3)  
bio2.clean.dfm.train<-bio.clean.dfm[1:5000,]
bio2.clean.dfm.test<-bio.clean.dfm[5001:nrow(tweets.df),]
nb2.classifier<-textmodel_nb(bio2.clean.dfm.train, tweets2.train.df$code)
nb2.classifier

## 
## Call:
## textmodel_nb.dfm(x = bio2.clean.dfm.train, y = tweets2.train.df$code)
## 
##  Distribution: multinomial ; priors: 0.3333333 0.3333333 0.3333333 ; smoothing value: 1 ; 5000 training documents;  fitted features.

tweets2.test.df$pred<-predict(nb2.classifier, bio2.clean.dfm.test)
confusion2.matrix<-table(predicted=tweets2.test.df$pred, actual=tweets2.test.df$code)
confusion2.matrix

##               actual
## predicted      elite policy elite public
##   elite         4762           60    318
##   policy elite    68         2532     35
##   public         424           50   3485

accuracy2<-mean(tweets2.test.df$pred==tweets2.test.df$code)*100
accuracy2

## [1] 91.86126

Now that we have developed a decent model to classify the author of tweets, we can use that to make predictions on a large number of tweets without having to hand code them! This is awesome because it saves lot of time and is probably more consistent than hand coding the bios. The next step in the analysis (which we won’t do here) is to determine if the public and the elites or policy makers tweet about the flood in similar or different ways. We could use a number of different approaches for this (dictionary bag-of-words approach, sentiment analysis,etc.). We won’t get into that, but that’s the next step (if you’re interested, let me know and I can try to make a follow-up module!)

Thank you for following along the machine learning text classification module. I hope you enjoyed it.