This notebook walks you through on:
Install tidyverse and vader if you do not
have them in your R environment.
tidyverse is a collection of R packages for data
sciencevader (Valence Aware Dictionary and sEntiment Reasoner)
is a rule-based sentiment analysis tool specifically attuned to social
media text# uncomment and run the lines below if you need to install these packages
# install.packages("tidyverse")
# install.packages("vader")
Load packages.
library(tidyverse)
library(vader)
df_tweets = read_csv('Ikea-tweets.csv')
## Rows: 349 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): username, text
## dbl (1): id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_tweets %>% head()
Print out the number of rows.
nrow(df_tweets)
## [1] 349
To analyze a piece of text using VADER, use the
get_vader() function. Here is an example using one of the
tweets.
compound score is the “overall” score between -1 (most
extreme negative) and +1 (most extreme positive).pos, neg, and neu are ratios
for proportions of text that fall in each category. These should all add
up to be 1.get_vader("Got my BeautyBase merch I am happy I got the first batch 🥰")
## word_scores
## "{0, 0, 0, 0, 0, 0, 2.7, 0, 0, 0, 0, 0, 0}"
## compound
## "0.572"
## pos
## "0.236"
## neu
## "0.764"
## neg
## "0"
## but_count
## "0"
Because we have a DataFrame (tibble) that contains tweet
texts in the text column, we will need to run
get_vader() on each row of the text column. We
can do this by using the lapply() function.
The code below stores the VADER results to a list variable named
vscores. Note that this may take a while (a few minutes) if
you have thousands of tweets.
vscores <- df_tweets$text %>% lapply(get_vader)
Extract compound score and positive/neutral/negative percentages into separate columns.
df_tweets <- df_tweets %>% mutate(
compound = vscores %>% sapply(function(v) { as.numeric(v["compound"]) }),
pos = vscores %>% sapply(function(v) { as.numeric(v["pos"]) }),
neu = vscores %>% sapply(function(v) { as.numeric(v["neu"]) }),
neg = vscores %>% sapply(function(v) { as.numeric(v["neg"]) }),
)
df_tweets
df_tweets %>%
arrange(desc(compound)) %>%
select(text, username, compound, pos, neu, neg) %>%
head(50)
df_tweets %>%
arrange(compound) %>%
select(text, username, compound, pos, neu, neg) %>%
head(50)
mean(df_tweets$compound)
## [1] 0.2370143
The box plot below gives you the spread and skewness of compound scores.
df_tweets %>%
ggplot(aes(x=compound)) +
theme_classic() +
geom_boxplot()
The histogram below gives you an approximate distribution of the compound scores.
df_tweets %>%
ggplot(aes(x=compound)) +
theme_classic() +
geom_histogram(bins = 20, color="black", fill="white")