Twitter Sentiment Analysis

🧭 Overview & Setup

This notebook walks you through on:

How to run a sentiment analysis on tweets using the VADER package
How to interpret the VADER scores
How to view top positive/negative tweets
How to visualize the compound scores distribution

📦 Install and load packages

Install tidyverse and vader if you do not have them in your R environment.

tidyverse is a collection of R packages for data science
vader (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool specifically attuned to social media text

# uncomment and run the lines below if you need to install these packages
# install.packages("tidyverse")
# install.packages("vader")

Load packages.

library(tidyverse)
library(vader)

📃 Read CSV file

df_tweets = read_csv('Ikea-tweets.csv')

## Rows: 349 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): username, text
## dbl (1): id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df_tweets %>% head()

Print out the number of rows.

nrow(df_tweets)

## [1] 349

🔮 Sentiment Analysis

📌 Sample usage

To analyze a piece of text using VADER, use the get_vader() function. Here is an example using one of the tweets.

compound score is the “overall” score between -1 (most extreme negative) and +1 (most extreme positive).
pos, neg, and neu are ratios for proportions of text that fall in each category. These should all add up to be 1.
We are mainly interested in the compound score.

get_vader("Got my BeautyBase merch I am happy I got the first batch 🥰")

##                                 word_scores 
## "{0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0}" 
##                                    compound 
##                                     "0.572" 
##                                         pos 
##                                     "0.236" 
##                                         neu 
##                                     "0.764" 
##                                         neg 
##                                         "0" 
##                                   but_count 
##                                         "0"

🧮 Calculate scores for all tweets

Because we have a DataFrame (tibble) that contains tweet texts in the text column, we will need to run get_vader() on each row of the text column. We can do this by using the lapply() function.

The code below stores the VADER results to a list variable named vscores. Note that this may take a while (a few minutes) if you have thousands of tweets.

vscores <- df_tweets$text %>% lapply(get_vader)

Extract compound score and positive/neutral/negative percentages into separate columns.

df_tweets <- df_tweets %>% mutate(
  compound = vscores %>% sapply(function(v) { as.numeric(v["compound"]) }),
  pos = vscores %>% sapply(function(v) { as.numeric(v["pos"]) }),
  neu = vscores %>% sapply(function(v) { as.numeric(v["neu"]) }),
  neg = vscores %>% sapply(function(v) { as.numeric(v["neg"]) }),
)
df_tweets

👍 50 most positive tweets

df_tweets %>% 
  arrange(desc(compound)) %>% 
  select(text, username, compound, pos, neu, neg) %>%
  head(50)

👎 50 most negative tweets

df_tweets %>% 
  arrange(compound) %>% 
  select(text, username, compound, pos, neu, neg) %>%
  head(50)

⚖️ Average compound score

mean(df_tweets$compound)

## [1] 0.2370143

📦 Box plot of compound scores

The box plot below gives you the spread and skewness of compound scores.

df_tweets %>% 
  ggplot(aes(x=compound)) + 
  theme_classic() +
  geom_boxplot()

📊 Histogram of compound scores

The histogram below gives you an approximate distribution of the compound scores.

df_tweets %>% 
  ggplot(aes(x=compound)) + 
  theme_classic() +
  geom_histogram(bins = 20, color="black", fill="white")