Load Packages
# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext",
"tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "sf",
"tmap", "here", "sentimentr", "gt", "htmlwidgets")
# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
1. Objective
Analyze the sentiment of Reddit users towards electric vehicles (EVs).
2. Search Reddit threads
In this analysis, I will use the following keywords to search for relevant threads on Reddit: EV, electric vehicles, electric cars, tesla, and byd. The last two keywords are major EV brands, it could help increasing data points by including these keywords. I have not restricted the search to any specific subreddit, as discussions about electric vehicles can span a wide range of topics, including environmental concerns, technological advancements, automated vehicles, and more. This approach ensures a comprehensive understanding of public sentiment across diverse contexts.
Download Threads
# list of keywords
keywords = c('ev', 'electric vehicles', 'electric cars', 'tesla', 'byd')
thread_list <- list()
for (keyword in keywords) {
urls <- find_thread_urls(keyword = keyword,
subreddit = "all",
sort_by = "relevance",
period = "all") %>%
drop_na()
rownames(urls) <- NULL
thread_list[[keyword]] <- urls
}
all_threads <- unique(do.call(rbind, thread_list))
Download comment data
# get individual comments
threads_content <- get_thread_content(all_threads$url)
Save data to RData
save(all_threads, threads_content, keywords, file = "ev_threads_comments.RData")
Load data back
load("ev_threads_comments.RData")
3. Data Cleaning
# Load stop words and define regular expression
data("stop_words")
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&|<|>"
# Stop word removal and tokenization
# Some threads cannot download the comments, for the thread data I will use the full one.
thread_clean <- all_threads %>%
mutate(text = paste(title, ". ", text)) %>% # concatenate title with text
mutate(text = str_replace_all(text, replace_reg, "")) %>%
mutate(date = date_utc)
thread_comment_clean <- threads_content$comments %>%
mutate(text = str_replace_all(comment, replace_reg, ""))
# Combine all threads and comments
thread_all_clean <- bind_rows(
thread_clean %>%
select(text, date),
thread_comment_clean %>%
select(text, date)
)
## Thread topic only
thread_topic_tokens <- thread_clean %>%
unnest_tokens(word, text, token='words') %>%
# remove stop words
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[a-z]")) %>%
filter(!word %in% keywords)
# Tokenize all text
thread_tokens <- thread_all_clean %>%
unnest_tokens(word, text, token='words') %>%
# remove stop words
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[a-z]")) %>%
filter(!word %in% keywords)
save(thread_clean, thread_all_clean, thread_topic_tokens, thread_tokens, file="thread_token,RData")
load("thread_token,RData")
4. Word Cloud
I made word clouds of threads (title + text) and threads with comments to see in-depth results.
# Set color to ease the visual analysis
n <- 20
h <- runif(n, 0, 1)
s <- runif(n, 0.6, 1)
v <- runif(n, 0.3, 0.7)
df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))
# Thread only wordcloud
thread_wordcloud <- thread_topic_tokens %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = 0,
maxRotation = 0,
ellipticity = 0.8)
# Threads and Comments
comments_wordcloud <- thread_tokens %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = 0,
maxRotation = 0,
ellipticity = 0.8)
saveWidget(thread_wordcloud, "thread_wordcloud.html", selfcontained = TRUE)
saveWidget(comments_wordcloud, "comments_wordcloud.html", selfcontained = TRUE)
Threads Word Cloud
Threads and Comments Word Cloud
There are common words such as electric, cars, and vehicles that are not particularly useful for this analysis, as we are focusing specifically on electric vehicles. Since these words are not directly part of the keywords we are analyzing, I will exclude them from the dataset to improve the relevance of the results.
filter_words <- c("electric", "car", "cars", "vehicle", "vehicles", "evs")
thread_wordcloud_filtered <- thread_topic_tokens %>%
filter(!word %in% filter_words) %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = 0,
maxRotation = 0,
ellipticity = 0.8)
comments_wordcloud_filtered <- thread_tokens %>%
filter(!word %in% filter_words) %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = 0,
maxRotation = 0,
ellipticity = 0.8)
saveWidget(thread_wordcloud_filtered, "thread_wordcloud_filtered.html", selfcontained = TRUE)
saveWidget(comments_wordcloud_filtered, "comments_wordcloud_filtered.html", selfcontained = TRUE)
Threads Word Cloud
Threads and Comments Word Cloud
When analyzing thread messages alone, the most common words are China, sales, model, percent, battery, market, Elon, and Musk. However, when including comments, the most frequently occurring word is people, followed by battery, time, charging, gas, money, drive, power, market, and miles, among others. This shift highlights a broader range of practical and user-oriented discussions in the comments, reflecting concerns and experiences related to electric vehicles.
5. Tri-gram Analysis
Extract tri-grams from text data.
# Get tri-grams of threads
tri_grams <- thread_clean %>%
select(text) %>%
unnest_tokens(output = three_words,
input = text,
token = "ngrams",
n = 3)
# Get tri-grams of threads + comments
tri_grams_all <- thread_all_clean %>%
select(text) %>%
unnest_tokens(output = three_words,
input = text,
token = "ngrams",
n = 3)
save(tri_grams, tri_grams_all, file="tri_grams.RData")
Remove tri-grams containing stop words or non-alphabetic terms.
load("tri_grams.RData")
#separate the paired words into two columns
words_ngram <- tri_grams %>%
separate(three_words, c("word1", "word2", "word3"), sep = " ")
# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_filtered <- words_ngram %>%
# drop stop words
filter(!word1 %in% stop_words$word &
!word2 %in% stop_words$word &
!word3 %in% stop_words$word) %>%
# drop non-alphabet-only strings
filter(str_detect(word1, "[a-z]")
& str_detect(word2, "[a-z]")
& str_detect(word3, "[a-z]"))
# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)
words_ngram_filtered %<>%
filter(stri_enc_isascii(word1)
& stri_enc_isascii(word2)
& stri_enc_isascii(word3))
# Sort counts:
words_counts <- words_ngram_filtered %>%
count(word1, word2, word3) %>%
arrange(desc(n))
head(words_counts, 20) %>%
knitr::kable()
word1 | word2 | word3 | n |
---|---|---|---|
ev | tax | credit | 11 |
chinese | electric | vehicles | 8 |
electric | car | sales | 7 |
electric | vehicle | sales | 7 |
electric | car | batteries | 6 |
ceo | elon | musk | 5 |
comprei | um | byd | 5 |
electric | vehicle | battery | 5 |
electric | vehicle | maker | 5 |
ev | maker | byd | 5 |
billion | pay | package | 4 |
byd | dolphin | mini | 4 |
chevrolet | blazer | ev | 4 |
electric | vehicle | batteries | 4 |
hybrid | electric | vehicles | 4 |
lithium | ion | batteries | 4 |
previously | repaired | model | 4 |
price | reducing | factors | 4 |
repaired | model | variants | 4 |
sell | electric | vehicles | 4 |
The most frequent tri-gram is ev tax credit, referring to the financial incentives for purchasing electric vehicles as a way to reduce carbon footprints. Other top tri-grams include Chinese electric vehicles, electric car/vehicle sales, electric car batteries, and Elon Musk, CEO of Tesla. Since our dataset also includes comments, let’s further analyze the tri-grams found in both threads and comments to identify patterns in user discussions.
#separate the paired words into two columns
words_ngram_all <- tri_grams_all %>%
separate(three_words, c("word1", "word2", "word3"), sep = " ")
# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_all_filtered <- words_ngram_all %>%
# drop stop words
filter(!word1 %in% stop_words$word &
!word2 %in% stop_words$word &
!word3 %in% stop_words$word) %>%
# drop non-alphabet-only strings
filter(str_detect(word1, "[a-z]")
& str_detect(word2, "[a-z]")
& str_detect(word3, "[a-z]"))
# Filter out words that are not encoded in ASCII
words_ngram_filtered %<>%
filter(stri_enc_isascii(word1)
& stri_enc_isascii(word2)
& stri_enc_isascii(word3))
# Sort counts:
words_counts_all <- words_ngram_all_filtered %>%
count(word1, word2, word3) %>%
arrange(desc(n))
head(words_counts_all, 20) %>%
knitr::kable()
word1 | word2 | word3 | n |
---|---|---|---|
subreddit | message | compose | 223 |
internal | combustion | engine | 177 |
lithium | ion | batteries | 152 |
img | emote | t5_2th52 | 119 |
internal | combustion | engines | 110 |
hydrogen | fuel | cell | 106 |
gas | powered | cars | 101 |
hydrogen | fuel | cells | 99 |
ev | tax | credit | 97 |
tow | truck | driver | 90 |
ev | charging | stations | 84 |
fossil | fuel | industry | 76 |
lithium | ion | battery | 69 |
gas | powered | vehicles | 67 |
gas | powered | car | 66 |
rare | earth | metals | 65 |
nuclear | power | plants | 64 |
burning | fossil | fuels | 62 |
upper | middle | class | 62 |
lead | acid | batteries | 60 |
By including comments in the analysis of tri-grams, we can disregard the top tri-gram as it is unrelated to the topic. This broader analysis reveals additional meaningful tri-grams such as internal combustion engine, lithium-ion batteries, hydrogen fuel cell, and gas-powered cars, all of which relate to various power sources for vehicles. Moreover, the data highlights intriguing topics such as the fossil fuel industry, rare earth metals, and nuclear power plants, which are often discussed in the context of energy and sustainability.
6. Sentiment Analysis
I calculated the sentiment scores using the sentimentr package and created separate dataframes for positive and negative sentiments. This separation facilitates clearer and more intuitive visualization of the sentiment distribution.
sentiment_scores <- sentiment_by(thread_clean$text)
## Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
## raw `character` vector is passed to `text.var`. This may be costly of time and
## memory. It is highly recommended that the user first runs the raw `character`
## vector through the `get_sentences` function.
thread_clean$sentiment <- sentiment_scores$ave_sentiment
# Filter positive sentiment texts
positive_texts <- thread_clean %>%
filter(sentiment > 0) %>%
arrange(desc(sentiment))
# Filter negative sentiment texts
negative_texts <- thread_clean %>%
filter(sentiment < 0) %>%
arrange(sentiment)
7. Display 10 sample texts
sentiment_example_dict <- bind_rows(
head(positive_texts %>% select(text, sentiment), 5),
head(negative_texts %>% select(text, sentiment), 5)
)
sentiment_example_dict %>%
gt() %>%
tab_header(
title = "10 Reddit Threads by Sentiment") %>%
cols_label(
text = "Text",
sentiment = "Sentiment Score",
) %>%
tab_style(
style = cell_text(weight = "bold", color = "black"),
locations = cells_body(columns = c(text))
) %>%
opt_table_lines()
10 Reddit Threads by Sentiment | |
Text | Sentiment Score |
---|---|
EVs are cleaner than gas cars, but a growing share of Americans don't believe it . | 0.8843312 |
Electric vehicle batteries are getting cheaper much faster than we expected . | 0.8683527 |
Yes, electric vehicles really are better than fossil fuel burners: As the Nobel prize committee eloquently put it: Lithium-ion batteries... have laid the foundation of a wireless, fossil fuel-free society, and are of the greatest benefit to humankind. . | 0.8033862 |
A cool guide to EV trucks right now [oc] . | 0.7166667 |
Trump wishes electric car supporters 'rot in hell' in Truth Social Christmas message . | 0.6933752 |
Four Dead In Fire As Tesla Doors Fail To Open After Crash . I can bear the front manual release, but the rear is totally unacceptable, dangerous and pure stupidity. | -0.7046454 |
MOjAnG nEvEr LiStEns, sToP cOmPlAinINg . | -0.6260990 |
Trump says he'd rather be electrocuted than eaten by a shark on bizarre rant apparently meant as an indictment of electric vehicles . | -0.6076220 |
Boycott Tesla ads to air during Super Bowl Tesla dances away from liability in Autopilot crashes by pointing to a note buried deep in the owners manual, that says Autopilot is only safe on freeways. . | -0.6050000 |
Ford Delays $12 Billion in EV Investments Due To UAW Strike Impact and Slow Consumer Demand . | -0.5938574 |
The results of the sentiment analysis using the dictionary method are debatable. For instance, the sentence “Trump wishes electric car supporters ‘rot in hell’ in Truth Social Christmas message”, which appears highly negative, is assigned a positive sentiment score. Additionally, the range of sentiment scores does not provide a meaningful relative value for comparison between threads. For example, the sentence “EVs are cleaner than gas cars, but a growing share of Americans don’t believe it” is assigned a higher sentiment score than “Electric vehicle batteries are getting cheaper much faster than we expected”. However, the latter conveys a more positive tone in its context. This discrepancy highlights the limitations of the dictionary-based approach in capturing nuanced sentiment.
To enhance the accuracy of sentiment analysis, I used a pre-trained BERT model as implemented in the provided python script to analyze Reddit threads.
write_csv(thread_clean, "thread_only.csv")
thread_only_bert <- read_csv("thread_only_bert.csv")
## New names:
## Rows: 1132 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): title, text, subreddit, url, bert_label dbl (6): ...1, timestamp,
## comments, year, sentiment, bert_score date (1): date_utc
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
# drop NAs
thread_only_bert %<>% drop_na('bert_label')
# See 10 threads
bind_rows(
head(thread_only_bert %>%
filter(bert_label %in% c('1 star', '2 stars'))
, 4),
head(thread_only_bert %>%
filter(bert_label %in% c('4 stars', '5 stars'))
, 4),
head(thread_only_bert %>%
filter(bert_label %in% c('3 stars'))
, 2)
) %>%
select(text, bert_label) %>%
gt() %>%
tab_header(
title = "10 Reddit Threads by BERT Label") %>%
cols_label(
text = "Text",
bert_label = "Bert Label",
) %>%
tab_style(
style = cell_text(weight = "bold", color = "black"),
locations = cells_body(columns = c(text))
) %>%
opt_table_lines()
10 Reddit Threads by BERT Label | |
Text | Bert Label |
---|---|
Lawsuit accuses Tesla of exaggerating driving range to boost EV sales . | 1 star |
Tariffs Aren't Going To Stop China's Affordable BYD EVs From Marching On Europe . | 1 star |
oMg i CaN't BeLiEvE iT . | 1 star |
Ford Is Getting Dealers to Agree to No-Haggle EV Sales . | 1 star |
DeLorean is back for the future with brand new EV . | 5 stars |
This guy parking in an EV Charging spot . | 5 stars |
Tesla Superchargers Finally Open To General Motors' EVs . | 5 stars |
Best EV Truck Around? I seem to think so! . Best EV around? I sure do love it! One of the best EVs around. Im a bit biased, but hard to beat 440 (460-470) range miles. 10k towing, air suspension and cool tech. Looks are subjective, but it sure does get a lot of attention - more so when I park at a Tesla SC station. Happy to own it. | 5 stars |
Tesla misses insurance firms Safest Cars list because its EVs dont crash often enough . | 3 stars |
Wife is set on an EV, which one to lease? . Her commute is about 80miles a day. If I had to choose one today it would be an i5 m60. But is that worth the cost over and i4? Are the i7 and iX worth it over the i5? | 3 stars |
From the BERT model’s results, one example rated as 5 stars, “This guy parking in an EV Charging spot”, does not appear particularly positive. However, the rest of the examples seem reasonable and align well with the intended sentiment. Compared to the dictionary-based method, I find the BERT model more effective in capturing nuanced sentiment, making it the preferred choice for this analysis.
8. Insights
# Extract year from the date column
thread_clean$year <- substr(thread_clean$date, 1, 4)
# Calculate median sentiment by year
median_sentiment <- thread_clean %>%
group_by(year) %>%
summarise(median_sentiment = median(sentiment))
# Create the plot
ggplot(thread_clean, aes(x = year, y = sentiment)) +
geom_jitter(width = 0.2, alpha = 0.6, color = "gray") +
geom_point(data = median_sentiment, aes(x = year, y = median_sentiment),
color = "red", size = 3, shape = 18) +
geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
labs(title = "Sentiment by Year with Median Values from Dictionary Method",
x = "Year",
y = "Sentiment") +
theme_minimal(base_size = 14)
The plot of sentiment distribution over the years reveals two key observations. First, the number of threads discussing electric vehicles has increased significantly over the past 10 years, suggesting growing public interest in the topic. Second, the sentiment distribution appears balanced between positive and negative sentiments from 2020 onwards. However, we cannot draw definitive conclusions from this observation due to the limitations of the sentiment scoring methods, as noted in the earlier analysis.
Then take a look at the results from BERT model.
# Extract year from the date column
thread_only_bert$year <- year(thread_only_bert$date_utc)
thread_only_bert %>%
ggplot(aes(x = year, fill = bert_label)) +
geom_bar(position = 'fill') +
scale_x_continuous(breaks = seq(min(thread_only_bert$year),
max(thread_only_bert$year),
by = 1)) +
scale_fill_brewer(palette = 'PuRd', direction = -1) +
labs(title = "Proportion of Sentiment by BERT label overtime") +
theme_minimal()
The distribution of sentiment based on the BERT model reveals changes in sentiment towards electric vehicles over time. In 2015, sentiment was predominantly positive. However, between 2016 and 2018, sentiment shifted towards being more negative. Starting in 2019, sentiment gradually became more positive again. From 2020 onwards, the distribution appears balanced between positive and negative sentiments, indicating a more neutral stance or mixed opinions in recent years.
See what are most words in positive and negative sentiments
# Wordcloud with a custom color palette
n <- 20
h <- runif(n, 0, 1)
s <- runif(n, 0.6, 1)
v <- runif(n, 0.3, 0.7)
df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))
positive_word_cloud <- positive_texts %>%
unnest_tokens(word, text, token='words') %>%
# remove stop words
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[a-z]")) %>%
filter(!word %in% keywords) %>%
filter(!word %in% filter_words) %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = -pi/6,
maxRotation = -pi/6,
rotateRatio = 1)
negative_word_cloud <- negative_texts %>%
unnest_tokens(word, text, token='words') %>%
# remove stop words
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[a-z]")) %>%
filter(!word %in% keywords) %>%
filter(!word %in% filter_words) %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = -pi/6,
maxRotation = -pi/6,
rotateRatio = 1)
saveWidget(positive_word_cloud, "positive_word_cloud.html", selfcontained = TRUE)
saveWidget(negative_word_cloud, "negative_word_cloud.html", selfcontained = TRUE)
Positive Sentiment Word Cloud
Negative Sentiment Word Cloud
From the word clouds of positive and negative threads, two common words appear in both sentiments: China and sales. Words frequently associated with positive sentiment include de (likely a prefix or part of words), model, percent, range, market, and vehicle brands like BMW and Ford. Interestingly, names like Elon Musk and Trump are more prevalent in threads with negative sentiment, indicating these figures might be polarizing topics in the discussion.
Let’s take a look at the positive and negative sentiment from BERT model.
# Create Numeric BERT label
thread_only_bert %<>%
mutate(bert_label_numeric = str_sub(bert_label, 1, 1) %>%
as.numeric())
# Make Positive and Negative sentiment dataframes
thread_pos_bert <- thread_only_bert %>%
filter(bert_label_numeric %in% c(4, 5))
thread_neg_bert <- thread_only_bert %>%
filter(bert_label_numeric %in% c(1, 2))
bert_positive_word_cloud <- thread_pos_bert %>%
unnest_tokens(word, text, token='words') %>%
# remove stop words
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[a-z]")) %>%
filter(!word %in% keywords) %>%
filter(!word %in% filter_words) %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = -pi/6,
maxRotation = -pi/6,
rotateRatio = 1)
bert_negative_word_cloud <- thread_neg_bert %>%
unnest_tokens(word, text, token='words') %>%
# remove stop words
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "[a-z]")) %>%
filter(!word %in% keywords) %>%
filter(!word %in% filter_words) %>%
count(word, sort=TRUE) %>%
wordcloud2(color = pal,
minRotation = -pi/6,
maxRotation = -pi/6,
rotateRatio = 1)
saveWidget(bert_positive_word_cloud, "bert_positive_word_cloud.html", selfcontained = TRUE)
saveWidget(bert_negative_word_cloud, "bert_negative_word_cloud.html", selfcontained = TRUE)
[BERT] Positive Sentiment Word Cloud
[BERT] Negative Sentiment Word Cloud
The results of common words in positive and negative sentiments using the BERT model are not significantly different from those obtained with the dictionary-based method at an overall level. However, the BERT model captures more meaningful words, such as battery, charge, range, and sales in positive sentiments. It also highlights polarizing topics like Elon Musk, Trump, and China in negative sentiments, providing deeper insights into the context and themes of discussions.
Conclusion
In summary, the sentiment analysis of Reddit threads and comments over the past 10 years reveals growing public interest in electric vehicles, as evidenced by the increasing number of discussions. While dictionary-based sentiment methods provided a general understanding, the BERT model offered more nuanced and meaningful insights, capturing specific themes such as battery, charging, and range in positive sentiments, and polarizing topics like Elon Musk, Trump, and China in negative sentiments. The analysis also highlights a shift in sentiment over time, from positive in 2015 to more negative in the 2016-2018 period, and then balancing out in recent years. These findings underscore the evolving perception of electric vehicles and the importance of leveraging advanced models like BERT for deeper and more accurate sentiment analysis.