Introduction

This is example code from Simon’s AI workshop which explores NLP and its reseaerch uses. See Simon’s GitHub page for more information.

From Simon’s document -

In this lab, we will introduce tools for natural language processing (NLP), from basic data preparation through to some exploration and building a simple machine learning model. We are only scratching the surface of what is possible with NLP methods in this lab. See the tidytext website for further examples.

You’ll need several packages for the lab including:

tidytext: a library for cleaning and processing text data SnowballC spacyr textstem word2vec uwot textdata

# Set working directory

setwd("../r_code/")

# install packages

#install.packages(c("tidytext", "SnowballC", "spacyr", "textstem", "word2vec", "uwot", "textdata"))

# library load

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.2.3

library(textstem)

Data

This example uses climate change tweets from 2015-2018 located here.

The data are held in the file twitter_sentiment_data.csv, which you can download from the github repository. Read these in and take a quick look. There are three columns: a sentiment estimate, the tweet (message) and a tweet id. The sentiment estimate was provided by a group of experts and are tagged as follows:

2 (News): the tweet links to factual news about climate change 1 (Pro): the tweet supports the belief of man-made climate change 0 (Neutral): the tweet neither supports nor refutes the belief of man-made climate change -1(Anti): the tweet does not believe in man-made climate change

# read in the data

dat <- read.csv("../data/twitter/twitter_sentiment_data.csv")

str(dat)

## 'data.frame':    43943 obs. of  3 variables:
##  $ sentiment: int  -1 1 1 1 2 0 2 2 0 1 ...
##  $ message  : chr  "@tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for"| __truncated__ "RT @NatGeoChannel: Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change"| __truncated__ "Fabulous! Leonardo #DiCaprio's film on #climate change is brilliant!!! Do watch. https://t.co/7rV6BrmxjW via @youtube" "RT @Mick_Fanning: Just watched this amazing documentary by leonardodicaprio on climate change. We all think thi"| __truncated__ ...
##  $ tweetid  : num  7.93e+17 7.93e+17 7.93e+17 7.93e+17 7.93e+17 ...

Our basic plan here is:

Prepare the data for analysis.
Visualize and explore the data
Create an embedding for the tweets. This represents each tweet as a vector of numbers, and can be used for further analysis
Create a simple machine learning model to predict the sentiment of a tweet

Text Processing

General Cleaning

Processing text data into a usable form can be one of the most time consuming parts of the analysis. Basically, we want to remove any characters or words that are irrelevant to any analysis. In addition, we should try to simplify and standardize the language used. For example, a computer will not necessarily recognize that ‘see’ and ‘seen’ are related to each other.

First, we’ll remove any retweets from the dataset (indicated by RT at the start of the message). While there are some applications where the number of retweets are of interest, we will consider them as duplicates for this exercise.

# remove retweets

dat = dat |>
  filter(str_starts(message, "RT", negate = TRUE))

# culls the tweets from 43943 to 18866

To illustrate the next steps, we’ll extract the fourth tweet from the dataset:

# extract the 4th tweet

tweet = dat[4, ]

# print the message
print(tweet$message)

## [1] "#BeforeTheFlood Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change... https://t.co/HCIZrPUhLF"

This is a typical tweet and has several issues for text processing:

There is a URL at the end of the tweet
There is at least one username (@…)
There are several hashtag (#…)

We’ll use several steps to clean this up. To illustrate these, we’ll walk through the individual steps for the first 5 tweets.

First extract the first 5 tweets:

#n = the number of rows

tidy_dat <- dat %>%
  slice_head(n = 5)

Remove various non-words (URLs, symbols, etc)

# str_replace reference example code 
# create an example data frame
fruits <- c("one apple", "two pears", "three bananas")

# replace the first vowel encountered with a "-"
str_replace(fruits, "[aeiou]", "-")

## [1] "-ne apple"     "tw- pears"     "thr-e bananas"

# replace all vowels with a dash
str_replace_all(fruits, "[aeiou]", "-")

## [1] "-n- -ppl-"     "tw- p--rs"     "thr-- b-n-n-s"

#make all of the vowels upper case
str_replace_all(fruits, "[aeiou]", toupper)

## [1] "OnE ApplE"     "twO pEArs"     "thrEE bAnAnAs"

# make anything with a "b" to NA
str_replace_all(fruits, "b", NA_character_)

## [1] "one apple" "two pears" NA

# what does [A-Za-z\\d] mean? 

tidy_dat <- tidy_dat %>%
  mutate(message = str_replace_all(message, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https", "")) 
head(tidy_dat)

Remove usernames (starting with @..)

tidy_dat <- tidy_dat %>% 
  mutate(message = str_replace_all(message, "@\\w+", "")) 
head(tidy_dat)

Convert the tweets into individual words or tokens. Note that this converts the data from being one line per tweet to one line per word

tidy_dat <- tidy_dat %>%
  unnest_tokens(word, message)
head(tidy_dat)

Finally, remove stopwords. These are a predefined set of commonly occurring words that have little value in analysis (e.g. the, and, …).

# remove stopwords

tidy_dat <- tidy_dat %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))
head(tidy_dat)

Word Matching

The last thing we’ll need to do is match words with similar meanings. There’s a couple of approaches to this: stemming and lemmatization. Stemming strips words back to the core stem using stem_words() from the textstem library. For example, here are 5 different words related to programming. The stemmer converts them all to program:

# example

# create a vector
words <- c("program","programming","programer","programs","programmed")

# stem them to convert them all to program
stem_words(words)

## [1] "program" "program" "program" "program" "program"

One disadvantage to this is that the stems may no longer reflect actual words. For example, the stem to climate is climat:

stem_words("climate")

## [1] "climat"

Lemmatization attempts to avoid these issues by converting words to a standard form, and accounting for the meaning of the surrounding words. Here we’ll use the spacyr package to perform lemmatization. Use this to compare the conversion of saw in these two phrases:

library(spacyr)

#need to run spacy install first!
#spacy_install()

spacy_parse("Owen saw a rabbit")

## successfully initialized (spaCy Version: 3.7.5, language model: en_core_web_sm)

# second sentence

spacy_parse("Owen cut a plank with a saw")

Integrating Cleaning and Word Matching

Step 1: Clean

## Combine all of the cleaning steps using dplyr and the text string packages 

tidy_dat <- dat %>%
  mutate(message = str_replace_all(message, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;|&lt;|&gt;|RT|https", "")) %>% 
  mutate(message = str_replace_all(message, "@\\w+", "")) %>%
  unnest_tokens(word, message) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

Step 2: Lemmatize

Note: Note to keep things running quickly in this lab, we’ll use textstem’s function for lemmatization. This is not quite as robust as the spacyr library, but substantially faster.

tidy_dat$clean_word <- lemmatize_words(tidy_dat$word)

Exploring the Data

We can now use the cleaned text data to do some exploration. We’ll start by making some word clouds. These are a very common visualization of text data, where words are randomly placed on a figure and scaled according to their frequency. We’ll use the wordcloud package to make plots, and create a data frame of the counts of individual words for use in the cloud.

tidy_count <- tidy_dat %>%
  count(clean_word) %>%
  arrange(-n) #in descending order


head(tidy_count)

First, let’s plot all the data. This is, not surprisingly, dominated by the words ‘climate’ and ‘change’

library(wordcloud)

# plot the clean words, using their frequency (tidy_count$n), and plot no more than 100 words - max.words
wordcloud(tidy_count$clean_word, tidy_count$n, max.words = 100)

For the next plot, we’ll extract only the ‘pro’ tweets, and skip plotting climate and change by setting them as stopwords

tidy_count_pos <- tidy_dat %>%
  filter(sentiment == 1,
         !clean_word %in% c("climate", "change", "global", "warm")) %>%
  count(clean_word) %>%
  arrange(-n)

# plot
wordcloud(tidy_count_pos$clean_word, tidy_count_pos$n, max.words = 100)

Negative tweets

tidy_count_neg <- tidy_dat %>%
  filter(sentiment == -1,
         !clean_word %in% c("climate", "change", "global", "warm")) %>%
  count(clean_word) %>%
  arrange(-n)

wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words = 100)

## Warning in wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words =
## 100): people could not be fit on page. It will not be plotted.

## Warning in wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words =
## 100): cause could not be fit on page. It will not be plotted.

## Warning in wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words =
## 100): threat could not be fit on page. It will not be plotted.

## Warning in wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words =
## 100): proof could not be fit on page. It will not be plotted.

## Warning in wordcloud(tidy_count_neg$clean_word, tidy_count_neg$n, max.words =
## 100): real could not be fit on page. It will not be plotted.

Sentiment Analysis

Next, we’ll estimate the sentiment of the tweets. The data already has a column labeled sentiment, which is a category describing whether the tweet was for or against climate change (or neutral). Sentiment analysis is a little different from this, as it attempts to score some text based on whether the words are overall positive, neutral or negative, irrespective of the belief in or against climate change. There are several different lexicons for sentiment analysis, some of which provide more fine grained detail. We’ll use here a function from the tidytext library (get_sentiment), which scores sentiment values between -5 (negative) and 5 (positive) for each word. You may be prompted to download the AFINN library when running this.

get_sentiments("afinn") %>%
  head()

We merge this with the cleaned data by joining on the cleaned word:

# new df = inner_join our tidy dataset, get_sentiments by clean_word (tidy data) and the word (get_sentiments)
tidy_sentiment <- inner_join(tidy_dat, get_sentiments("afinn"), by = c("clean_word" = "word"))

head(tidy_sentiment)

And we can make a word cloud of the positive terms used in conjunction with climate change (I am well aware of the irony of trump being considered positive here, so I’m going to remove it):

tidy_count_pos <- tidy_sentiment %>%
  filter(value > 1,
         !clean_word %in% c("climate", "change", "global", "warm", "trump")) %>%
  count(clean_word) %>%
  arrange(-n)


wordcloud(tidy_count_pos$clean_word, tidy_count_pos$n, max.words = 100)

**Embedding Text Data

To go further in the analysis of text data, we need to use a text embedding. This converts the text to a numeric representation in a high dimensional space. The simplest form of this is one-hot encoding, which creates a binary matrix with one column per word, and one row per tweet. If the word occurs in that tweet, then it’s labeled with a 1, and a 0 if not. One hot encoding works well with a small number of words, but scales poorly with richer text.

Embeddings are more complex representations of text, usually created by analyzing which words are likely to occur in similar contexts. It has a lot of similarities to principal component analysis for numeric data, in which complex data can be represented by a small number of components that capture correlations between the variables. For text, these means that the embedding for ‘dog’ and ‘cat’ will be similar, but ‘dog’ and ‘car’ will be dissimilar. This can then be used to explore the similarity between pieces of text, or (as we’ll see below) to use text in machine learning models. These embeddings are a key part of large language models (e.g. ChatGPT), where they are used to relate prompts or questions to the appropriate text that makes up a response.

While it’s possible to create your own embedding (which is useful for specific projects), this can be quite time consuming, and can require a substantial amount of text. In the example we’ll use below, we’ll use an embedding that was created using a model called Word2Vec and trained using Google news articles. You can download the file that contains the embedding weights from the Google Drive folder:

Google Drive Folder

A good selection of alternative, pre-trained embeddings can be found at Hugging Face:

Hugging Face

Load the word2vec package to find the embeddings for different pieces of text, and we’ll need to load the embeddings file:

library(word2vec)

model = read.word2vec(file = "../data/twitter/GoogleNews-vectors-negative300.bin")

As an example, here is the embedding for the word ‘cat’ (I’ve just printed the first 50 values):

predict(model, "cat", type = "embedding")[1, 1:50]

##  [1]  0.012329102  0.204101562 -0.285156250  0.216796875  0.118164062
##  [6]  0.083007812  0.049804688 -0.009521484  0.220703125 -0.125976562
## [11]  0.080566406 -0.585937500 -0.004455566 -0.296875000 -0.013122559
## [16] -0.083496094  0.050537109  0.151367188 -0.449218750 -0.013549805
## [21]  0.214843750 -0.147460938  0.224609375 -0.125000000 -0.097167969
## [26]  0.249023438 -0.289062500  0.365234375  0.412109375 -0.085937500
## [31] -0.078613281 -0.197265625 -0.090820312 -0.141601562 -0.102539062
## [36]  0.130859375 -0.003463745  0.072265625  0.044189453  0.345703125
## [41]  0.074707031 -0.112304688  0.067382812  0.112304688  0.019775391
## [46] -0.123535156  0.209960938 -0.072265625 -0.027832031  0.055419922

It’s pretty meaningless to us mortals, but this is a representation of the word ‘cat’ that a computer can work with. To follow the example given above, we can extract these for ‘cat’, ‘dog’ and ‘car’, and explore the correlations with these

cat_wv = predict(model, "cat", type = "embedding")[1, ]
car_wv = predict(model, "car", type = "embedding")[1, ]
dog_wv = predict(model, "dog", type = "embedding")[1, ]

# highly correlated
cor(cat_wv, dog_wv)

## [1] 0.7607611

# less correlated
cor(car_wv, dog_wv)

## [1] 0.3075351

#visual representation 
plot(cat_wv, dog_wv, xlab = "cat", ylab = "dog")

plot(cat_wv, car_wv, xlab = "cat", ylab = "car")

Vectorized Embeddings

vectorized_words = predict(model, tidy_dat$clean_word, 
                           type = "embedding")

The result (vectorized_docs) is a numeric array with 300 columns and the same number of rows as the cleaned words. We’ll now collapse the values into mean embeddings for each tweet. To do this we have to add (and subsequently remove) the tweet id from the cleaned data.

# Cleaning Steps

vectorized_words = as.data.frame(vectorized_words)
vectorized_words$id = tidy_dat$tweetid

vectorized_docs <- vectorized_words %>% 
  drop_na() %>%
  group_by(id) %>% 
  summarise_all(mean, na.rm = TRUE) %>% 
  select(-id)


str(vectorized_docs)

## tibble [18,807 × 300] (S3: tbl_df/tbl/data.frame)
##  $ V1  : num [1:18807] 0.069449 0.12027 -0.039856 -0.000741 0.135742 ...
##  $ V2  : num [1:18807] 0.00402 0.08704 0.14403 0.0993 0.01526 ...
##  $ V3  : num [1:18807] 0.002511 0.000448 -0.006409 0.023615 -0.10791 ...
##  $ V4  : num [1:18807] 0.0628 0.1448 0.1461 0.0753 0.0866 ...
##  $ V5  : num [1:18807] -0.2254 -0.1473 -0.0703 -0.0958 -0.2291 ...
##  $ V6  : num [1:18807] 0.00878 -0.12399 -0.05118 -0.0883 0.05034 ...
##  $ V7  : num [1:18807] 0.0551 -0.0297 0.0404 -0.0152 0.0649 ...
##  $ V8  : num [1:18807] -0.0784 -0.0306 -0.0867 -0.0907 -0.0605 ...
##  $ V9  : num [1:18807] 0.1739 0.1174 0.0393 0.0378 0.154 ...
##  $ V10 : num [1:18807] -0.0105 0.1181 0.1462 0.0787 0.1376 ...
##  $ V11 : num [1:18807] -0.01046 -0.00273 0.01135 -0.07455 -0.06138 ...
##  $ V12 : num [1:18807] -0.0423 -0.1243 -0.1106 -0.073 -0.0456 ...
##  $ V13 : num [1:18807] -0.1085 -0.0684 -0.0844 -0.0697 -0.0501 ...
##  $ V14 : num [1:18807] 0.1108 0.055 0.0859 0.0397 0.0733 ...
##  $ V15 : num [1:18807] -0.0655 -0.1134 -0.0443 -0.0819 -0.1125 ...
##  $ V16 : num [1:18807] 0.0503 0.1453 0.1163 0.0523 0.1952 ...
##  $ V17 : num [1:18807] 0.12322 -0.09059 -0.03635 0.00535 0.0521 ...
##  $ V18 : num [1:18807] 0.0642 0.0712 0.1431 0.1327 0.1313 ...
##  $ V19 : num [1:18807] -0.01314 0.09485 0.03541 -0.00358 -0.04517 ...
##  $ V20 : num [1:18807] -0.1177 -0.1774 -0.0812 -0.1014 -0.2564 ...
##  $ V21 : num [1:18807] 0.0448 0.1052 -0.0387 -0.0643 -0.0804 ...
##  $ V22 : num [1:18807] 0.1586 0.1618 0.0813 0.0452 0.0982 ...
##  $ V23 : num [1:18807] 0.0106 0.0518 0.0569 0.0613 0.0337 ...
##  $ V24 : num [1:18807] -0.0284 -0.0103 -0.0309 -0.0232 0.0137 ...
##  $ V25 : num [1:18807] -9.08e-03 -9.25e-05 -4.49e-02 -4.98e-02 9.90e-02 ...
##  $ V26 : num [1:18807] -0.00694 0.11068 0.01202 0.04246 0.12715 ...
##  $ V27 : num [1:18807] -0.0714 -0.0527 -0.029 -0.079 -0.0406 ...
##  $ V28 : num [1:18807] -0.00173 0.09306 0.07904 0.0501 0.10234 ...
##  $ V29 : num [1:18807] 0.0247 0.1368 0.1075 0.0479 0.0742 ...
##  $ V30 : num [1:18807] -0.12961 -0.091 -0.04478 0.00625 0.07089 ...
##  $ V31 : num [1:18807] -0.0234 -0.0966 -0.0291 0.0464 -0.0759 ...
##  $ V32 : num [1:18807] -0.09392 -0.02847 0.00296 -0.02712 -0.09229 ...
##  $ V33 : num [1:18807] -0.1637 -0.0323 -0.0519 -0.0253 -0.0189 ...
##  $ V34 : num [1:18807] 0.04618 0.03705 0.00763 0.0354 0.08174 ...
##  $ V35 : num [1:18807] 0.0443 0.0592 -0.1122 -0.0132 0.1174 ...
##  $ V36 : num [1:18807] 0.00254 -0.05735 -0.05159 -0.03039 -0.0529 ...
##  $ V37 : num [1:18807] -0.0569 0.0515 -0.0719 -0.0427 0.0426 ...
##  $ V38 : num [1:18807] 0.04977 0.16665 -0.04359 0.06809 -0.00879 ...
##  $ V39 : num [1:18807] -0.02428 -0.00155 0.12598 0.09386 0.07783 ...
##  $ V40 : num [1:18807] 0.0183 0.041 0.0541 0.0247 0.0546 ...
##  $ V41 : num [1:18807] 0.0521 -0.028 0.0179 -0.0161 0.0719 ...
##  $ V42 : num [1:18807] -0.00943 0.02698 -0.01935 -0.01079 -0.03101 ...
##  $ V43 : num [1:18807] 0.057 0.0927 0.1541 0.1055 0.0166 ...
##  $ V44 : num [1:18807] 0.00241 0.08394 0.12037 0.03789 0.02969 ...
##  $ V45 : num [1:18807] 0.0167 -0.0331 -0.0134 -0.0937 -0.1115 ...
##  $ V46 : num [1:18807] -0.0898 -0.0512 -0.0899 -0.019 -0.1294 ...
##  $ V47 : num [1:18807] -0.10784 0.02003 0.00916 0.01565 -0.02183 ...
##  $ V48 : num [1:18807] -0.0287 0.0481 0.0244 -0.0542 0.0581 ...
##  $ V49 : num [1:18807] -0.107 -0.146 -0.101 -0.133 -0.119 ...
##  $ V50 : num [1:18807] 0.0804 0.0904 0.0623 0.125 0.1038 ...
##  $ V51 : num [1:18807] -0.07956 0.00354 -0.01279 0.02764 -0.02808 ...
##  $ V52 : num [1:18807] 0.0518 0.1193 0.0366 0.0836 0.0594 ...
##  $ V53 : num [1:18807] -0.0256 -0.0558 0.0549 -0.0428 -0.1232 ...
##  $ V54 : num [1:18807] 0.0563 -0.1274 0.021 -0.0945 -0.0481 ...
##  $ V55 : num [1:18807] -0.0657 -0.12687 0.00824 -0.06747 -0.01807 ...
##  $ V56 : num [1:18807] -0.073451 0.014323 -0.000839 -0.056435 0.039795 ...
##  $ V57 : num [1:18807] -0.1375 -0.1027 -0.0825 -0.115 -0.2191 ...
##  $ V58 : num [1:18807] -0.0871 -0.0172 -0.0953 -0.1347 -0.2182 ...
##  $ V59 : num [1:18807] -0.06832 -0.00553 0.03067 -0.08169 -0.04656 ...
##  $ V60 : num [1:18807] -0.1221 -0.1185 -0.0299 -0.0393 -0.1279 ...
##  $ V61 : num [1:18807] -0.06634 0.00435 -0.05604 -0.06337 -0.06835 ...
##  $ V62 : num [1:18807] 3.49e-02 5.55e-02 7.63e-05 3.04e-02 4.60e-02 ...
##  $ V63 : num [1:18807] -0.08433 -0.07812 -0.02451 -0.15636 0.00823 ...
##  $ V64 : num [1:18807] -0.1131 -0.1088 -0.0892 0.0183 -0.1204 ...
##  $ V65 : num [1:18807] 0.0183 0.0711 0.0756 -0.0532 0.0334 ...
##  $ V66 : num [1:18807] -0.08282 -0.03923 -0.11503 -0.07454 -0.00576 ...
##  $ V67 : num [1:18807] -0.0674 -0.0188 -0.0636 -0.0194 -0.0236 ...
##  $ V68 : num [1:18807] 0.1646 0.0342 0.0545 0.0932 0.2259 ...
##  $ V69 : num [1:18807] 0.000174 -0.027791 -0.009491 0.017306 -0.05835 ...
##  $ V70 : num [1:18807] 0.0943 0.0285 0.0203 0.0585 -0.0642 ...
##  $ V71 : num [1:18807] -0.05246 -0.08122 -0.03394 -0.00352 -0.02686 ...
##  $ V72 : num [1:18807] 0.0825 0.0918 0.0969 0.1173 0.0516 ...
##  $ V73 : num [1:18807] 0.08616 0.00704 0.01393 0.11563 0.06124 ...
##  $ V74 : num [1:18807] 0.000174 0.037059 0.022522 0.024569 0.046121 ...
##  $ V75 : num [1:18807] -0.1041 -0.2201 -0.0367 -0.1118 -0.0871 ...
##  $ V76 : num [1:18807] -0.0931 -0.0707 -0.0795 -0.0666 -0.0725 ...
##  $ V77 : num [1:18807] 0.0691 0.129 0.1004 0.1079 0.1499 ...
##  $ V78 : num [1:18807] 0.0689 0.1582 0.0685 0.1169 0.1241 ...
##  $ V79 : num [1:18807] 0.08981 0.00376 -0.13673 -0.05162 -0.00022 ...
##  $ V80 : num [1:18807] 0.08245 0.07149 -0.00287 -0.01054 0.19546 ...
##  $ V81 : num [1:18807] 0.1687 0.1961 0.0158 0.0544 0.1087 ...
##  $ V82 : num [1:18807] 0.043806 0.014105 -0.028778 -0.000401 0.065332 ...
##  $ V83 : num [1:18807] -0.053471 -0.040732 -0.051445 -0.000025 -0.041528 ...
##  $ V84 : num [1:18807] 0.0271 -0.0393 -0.1495 -0.1289 -0.114 ...
##  $ V85 : num [1:18807] 0.0537 -0.0444 0.0754 0.0621 0.021 ...
##  $ V86 : num [1:18807] 0.000131 0.027903 0.063316 0.065372 -0.105139 ...
##  $ V87 : num [1:18807] -0.1313 -0.0911 -0.0926 -0.113 -0.1747 ...
##  $ V88 : num [1:18807] 0.00237 0.0498 0.0761 0.05342 0.19375 ...
##  $ V89 : num [1:18807] 0.031 0.0937 0.047 0.0116 0.0904 ...
##  $ V90 : num [1:18807] 0.082 0.089 0.1552 0.0595 -0.0166 ...
##  $ V91 : num [1:18807] 0.0752 0.0528 0.0114 0.0735 0.1198 ...
##  $ V92 : num [1:18807] -0.10951 0.00273 -0.00394 -0.08299 -0.04473 ...
##  $ V93 : num [1:18807] -0.0204 -0.1079 -0.0512 -0.061 -0.1051 ...
##  $ V94 : num [1:18807] -0.0825 -0.0845 -0.1245 -0.05 -0.0167 ...
##  $ V95 : num [1:18807] 0.0439 -0.0628 -0.0837 -0.063 -0.029 ...
##  $ V96 : num [1:18807] -0.11565 -0.10425 -0.02109 0.00361 0.02881 ...
##  $ V97 : num [1:18807] 0.0801 0.0621 0.076 0.0299 0.023 ...
##  $ V98 : num [1:18807] 0.0498 -0.0562 0.0639 0.0227 0.0625 ...
##  $ V99 : num [1:18807] 0.0178 0.0731 0.1258 0.0805 0.0858 ...
##   [list output truncated]

Cluster Analysis

We’ll first use a K-means cluster function to group the tweets into 4 sets.

tweet_km = kmeans(vectorized_docs, centers = 4)

We can also visualize the embeddings using other dimension reduction techniques. Here we use UMAP, a non-linear, efficient way of collapsing high-dimensional data to low (usually 2) dimensions

library(uwot)

## Warning: package 'uwot' was built under R version 4.2.3

Carry out dimensionality reduction of a dataset using the Uniform Manifold Approximation and Projection (UMAP) method (McInnes et al., 2018). Some of the following help text is lifted verbatim from the Python reference implementation at https://github.com/lmcinnes/umap.

viz <- umap(vectorized_docs, n_neighbors = 15, 
            min_dist = 0.001, spread = 4, n_threads = 2)

This can be plotted - each point here represents an individual tweet, and the colors are the clusters we created in the previous step. Note there are quite a lot of outliers that could be potentially removed, and that one cluster is very distinct from the others. This may suggest a group of tweets that deal with different aspect of climate change. (You could plot the word cloud for these tweets to see if that shows some differences).

#visualize using ggplot

library(ggplot2)

df <- data.frame(x = viz[, 1], y = viz[, 2],
                 cluster = as.factor(tweet_km$cluster),
                 stringsAsFactors = FALSE)

ggplot(df, aes(x = x, y = y, col = cluster)) +
  geom_point() + theme_void()

Using Embeddings in Machine Learning

As a last step, we’ll briefly look at using these embeddings in a machine learning model. We’ll build a model to try and predict the sentiment of a tweet (positive or negative) from it’s content. We’ll use a random forest model with the embeddings as features, and the sentiment value as a label. We’ll first need to integrate our embedding data with the sentiment score we generated earlier. First, we’ll remake the average embedding values per tweet, but this time we’ll keep the tweet id.

vectorized_docs_ml <- vectorized_words %>% 
  drop_na() %>%
  group_by(id) %>% 
  summarise_all(mean, na.rm = TRUE)

tidy_sentiment <- tidy_sentiment %>%
  group_by(tweetid) %>%
  summarize(value = mean(value)) %>%
  mutate(sentiment = ifelse(value > 0, 1, 0)) #if value is greater than 0, make it a 1, if not, make it a zero

Then we merge these two datasets together using the tweet id, and remove any columns we do not want to use in the ML model

vectorized_docs_ml = inner_join(vectorized_docs_ml, 
                                tidy_sentiment, 
                                by = c("id" = "tweetid"))

# remove id and value
# make sentiment a factor
vectorized_docs_ml = vectorized_docs_ml %>%
  select(-id, -value) %>%
  mutate(sentiment = as.factor(sentiment))

Load the random forest packages

library(caret)
library(ranger)

Now form a training and test set (80/20 split):

train_id = createDataPartition(vectorized_docs_ml$sentiment, p = 0.8)
train = vectorized_docs_ml[train_id[[1]], ] 
test = vectorized_docs_ml[-train_id[[1]], ]

Train the model:

fit_rf = ranger(sentiment ~ ., train)

Predict for the test dataset:

y_pred = predict(fit_rf, test)$prediction

And get the performance metrics:

confusionMatrix(test$sentiment, y_pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1375  165
##          1  384  719
##                                           
##                Accuracy : 0.7923          
##                  95% CI : (0.7763, 0.8076)
##     No Information Rate : 0.6655          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5605          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7817          
##             Specificity : 0.8133          
##          Pos Pred Value : 0.8929          
##          Neg Pred Value : 0.6519          
##              Prevalence : 0.6655          
##          Detection Rate : 0.5202          
##    Detection Prevalence : 0.5827          
##       Balanced Accuracy : 0.7975          
##                                           
##        'Positive' Class : 0               
##

Natural Language Processing Example

David Leydet

2024-09-30