Exploring Sentiments Towards Electric Vehicles: A Reddit Analysis

Thanawit Suwannikom

2024-11-29

Load Packages

# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "httr", "tidytext", 
              "tidyverse", "igraph", "ggraph", "wordcloud2", "textdata", "sf", 
              "tmap", "here", "sentimentr", "gt", "htmlwidgets")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

1. Objective

Analyze the sentiment of Reddit users towards electric vehicles (EVs).

2. Search Reddit threads

In this analysis, I will use the following keywords to search for relevant threads on Reddit: EV, electric vehicles, electric cars, tesla, and byd. The last two keywords are major EV brands, it could help increasing data points by including these keywords. I have not restricted the search to any specific subreddit, as discussions about electric vehicles can span a wide range of topics, including environmental concerns, technological advancements, automated vehicles, and more. This approach ensures a comprehensive understanding of public sentiment across diverse contexts.

Download Threads

# list of keywords
keywords = c('ev', 'electric vehicles', 'electric cars', 'tesla', 'byd')

thread_list <- list()

for (keyword in keywords) {
  urls <- find_thread_urls(keyword = keyword, 
                           subreddit = "all", 
                           sort_by = "relevance", 
                           period = "all") %>%
    drop_na()

  rownames(urls) <- NULL
  
  thread_list[[keyword]] <- urls
}

all_threads <- unique(do.call(rbind, thread_list))

Download comment data

# get individual comments
threads_content <- get_thread_content(all_threads$url)

Save data to RData

save(all_threads, threads_content, keywords, file = "ev_threads_comments.RData")

Load data back

load("ev_threads_comments.RData")

3. Data Cleaning

# Load stop words and define regular expression
data("stop_words")
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

# Stop word removal and tokenization

# Some threads cannot download the comments, for the thread data I will use the full one.
thread_clean <- all_threads %>%
  mutate(text = paste(title, ". ", text)) %>% # concatenate title with text
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  mutate(date = date_utc)

thread_comment_clean <- threads_content$comments %>%
  mutate(text = str_replace_all(comment, replace_reg, ""))

# Combine all threads and comments
thread_all_clean <- bind_rows(
  thread_clean %>%
    select(text, date),
  thread_comment_clean %>%
    select(text, date)
)

## Thread topic only
thread_topic_tokens <- thread_clean %>%
  unnest_tokens(word, text, token='words') %>%
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[a-z]")) %>%
  filter(!word %in% keywords)

# Tokenize all text
thread_tokens <- thread_all_clean %>%
  unnest_tokens(word, text, token='words') %>%
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[a-z]")) %>%
  filter(!word %in% keywords)

save(thread_clean, thread_all_clean, thread_topic_tokens, thread_tokens, file="thread_token,RData")

load("thread_token,RData")

4. Word Cloud

I made word clouds of threads (title + text) and threads with comments to see in-depth results.

# Set color to ease the visual analysis
n <- 20
h <- runif(n, 0, 1)
s <- runif(n, 0.6, 1)
v <- runif(n, 0.3, 0.7)

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

# Thread only wordcloud
thread_wordcloud <- thread_topic_tokens %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)


# Threads and Comments
comments_wordcloud <- thread_tokens %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

saveWidget(thread_wordcloud, "thread_wordcloud.html", selfcontained = TRUE)

saveWidget(comments_wordcloud, "comments_wordcloud.html", selfcontained = TRUE)

Threads Word Cloud

Threads and Comments Word Cloud

There are common words such as electric, cars, and vehicles that are not particularly useful for this analysis, as we are focusing specifically on electric vehicles. Since these words are not directly part of the keywords we are analyzing, I will exclude them from the dataset to improve the relevance of the results.

filter_words <- c("electric", "car", "cars", "vehicle", "vehicles", "evs")

thread_wordcloud_filtered <- thread_topic_tokens %>%
  filter(!word %in% filter_words) %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

comments_wordcloud_filtered <- thread_tokens %>%
  filter(!word %in% filter_words) %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

saveWidget(thread_wordcloud_filtered, "thread_wordcloud_filtered.html", selfcontained = TRUE)

saveWidget(comments_wordcloud_filtered, "comments_wordcloud_filtered.html", selfcontained = TRUE)

Threads Word Cloud

Threads and Comments Word Cloud

When analyzing thread messages alone, the most common words are China, sales, model, percent, battery, market, Elon, and Musk. However, when including comments, the most frequently occurring word is people, followed by battery, time, charging, gas, money, drive, power, market, and miles, among others. This shift highlights a broader range of practical and user-oriented discussions in the comments, reflecting concerns and experiences related to electric vehicles.

5. Tri-gram Analysis

Extract tri-grams from text data.

# Get tri-grams of threads
tri_grams <- thread_clean %>%
  select(text) %>%
  unnest_tokens(output = three_words,
                input = text,
                token = "ngrams",
                n = 3)

# Get tri-grams of threads + comments
tri_grams_all <- thread_all_clean %>%
  select(text) %>%
  unnest_tokens(output = three_words,
                input = text,
                token = "ngrams",
                n = 3)

save(tri_grams, tri_grams_all, file="tri_grams.RData")

Remove tri-grams containing stop words or non-alphabetic terms.

load("tri_grams.RData")

#separate the paired words into two columns
words_ngram <- tri_grams %>%
  separate(three_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_filtered <- words_ngram %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & 
           !word2 %in% stop_words$word & 
           !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") 
         & str_detect(word2, "[a-z]")
         & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCII, google 'ASCII table'
library(stringi)

words_ngram_filtered %<>% 
  filter(stri_enc_isascii(word1) 
         & stri_enc_isascii(word2) 
         & stri_enc_isascii(word3))

# Sort counts:
words_counts <- words_ngram_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()

word1	word2	word3	n
ev	tax	credit	11
chinese	electric	vehicles	8
electric	car	sales	7
electric	vehicle	sales	7
electric	car	batteries	6
ceo	elon	musk	5
comprei	um	byd	5
electric	vehicle	battery	5
electric	vehicle	maker	5
ev	maker	byd	5
billion	pay	package	4
byd	dolphin	mini	4
chevrolet	blazer	ev	4
electric	vehicle	batteries	4
hybrid	electric	vehicles	4
lithium	ion	batteries	4
previously	repaired	model	4
price	reducing	factors	4
repaired	model	variants	4
sell	electric	vehicles	4

The most frequent tri-gram is ev tax credit, referring to the financial incentives for purchasing electric vehicles as a way to reduce carbon footprints. Other top tri-grams include Chinese electric vehicles, electric car/vehicle sales, electric car batteries, and Elon Musk, CEO of Tesla. Since our dataset also includes comments, let’s further analyze the tri-grams found in both threads and comments to identify patterns in user discussions.

#separate the paired words into two columns
words_ngram_all <- tri_grams_all %>%
  separate(three_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_all_filtered <- words_ngram_all %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & 
           !word2 %in% stop_words$word & 
           !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") 
         & str_detect(word2, "[a-z]")
         & str_detect(word3, "[a-z]"))

# Filter out words that are not encoded in ASCII
words_ngram_filtered %<>% 
  filter(stri_enc_isascii(word1) 
         & stri_enc_isascii(word2) 
         & stri_enc_isascii(word3))

# Sort counts:
words_counts_all <- words_ngram_all_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts_all, 20) %>% 
  knitr::kable()

word1	word2	word3	n
subreddit	message	compose	223
internal	combustion	engine	177
lithium	ion	batteries	152
img	emote	t5_2th52	119
internal	combustion	engines	110
hydrogen	fuel	cell	106
gas	powered	cars	101
hydrogen	fuel	cells	99
ev	tax	credit	97
tow	truck	driver	90
ev	charging	stations	84
fossil	fuel	industry	76
lithium	ion	battery	69
gas	powered	vehicles	67
gas	powered	car	66
rare	earth	metals	65
nuclear	power	plants	64
burning	fossil	fuels	62
upper	middle	class	62
lead	acid	batteries	60

By including comments in the analysis of tri-grams, we can disregard the top tri-gram as it is unrelated to the topic. This broader analysis reveals additional meaningful tri-grams such as internal combustion engine, lithium-ion batteries, hydrogen fuel cell, and gas-powered cars, all of which relate to various power sources for vehicles. Moreover, the data highlights intriguing topics such as the fossil fuel industry, rare earth metals, and nuclear power plants, which are often discussed in the context of energy and sustainability.

6. Sentiment Analysis

I calculated the sentiment scores using the sentimentr package and created separate dataframes for positive and negative sentiments. This separation facilitates clearer and more intuitive visualization of the sentiment distribution.

sentiment_scores <- sentiment_by(thread_clean$text)

## Warning: Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
## raw `character` vector is passed to `text.var`. This may be costly of time and
## memory.  It is highly recommended that the user first runs the raw `character`
## vector through the `get_sentences` function.

thread_clean$sentiment <- sentiment_scores$ave_sentiment

# Filter positive sentiment texts
positive_texts <- thread_clean %>%
  filter(sentiment > 0) %>%
  arrange(desc(sentiment))

# Filter negative sentiment texts
negative_texts <- thread_clean %>%
  filter(sentiment < 0) %>%
  arrange(sentiment)

7. Display 10 sample texts

sentiment_example_dict <- bind_rows(
  head(positive_texts %>% select(text, sentiment), 5),
  head(negative_texts %>% select(text, sentiment), 5)
)


sentiment_example_dict %>%
  gt() %>%
  tab_header(
    title = "10 Reddit Threads by Sentiment") %>%
  cols_label(
    text = "Text",
    sentiment = "Sentiment Score",
  ) %>%
  tab_style(
    style = cell_text(weight = "bold", color = "black"),
    locations = cells_body(columns = c(text))
  ) %>%
  opt_table_lines()

Text	Sentiment Score
10 Reddit Threads by Sentiment
EVs are cleaner than gas cars, but a growing share of Americans don't believe it .	0.8843312
Electric vehicle batteries are getting cheaper much faster than we expected .	0.8683527
Yes, electric vehicles really are better than fossil fuel burners: As the Nobel prize committee eloquently put it: Lithium-ion batteries... have laid the foundation of a wireless, fossil fuel-free society, and are of the greatest benefit to humankind. .	0.8033862
A cool guide to EV trucks right now [oc] .	0.7166667
Trump wishes electric car supporters 'rot in hell' in Truth Social Christmas message .	0.6933752
Four Dead In Fire As Tesla Doors Fail To Open After Crash . I can bear the front manual release, but the rear is totally unacceptable, dangerous and pure stupidity.	-0.7046454
MOjAnG nEvEr LiStEns, sToP cOmPlAinINg .	-0.6260990
Trump says he'd rather be electrocuted than eaten by a shark on bizarre rant apparently meant as an indictment of electric vehicles .	-0.6076220
Boycott Tesla ads to air during Super Bowl Tesla dances away from liability in Autopilot crashes by pointing to a note buried deep in the owners manual, that says Autopilot is only safe on freeways. .	-0.6050000
Ford Delays $12 Billion in EV Investments Due To UAW Strike Impact and Slow Consumer Demand .	-0.5938574

The results of the sentiment analysis using the dictionary method are debatable. For instance, the sentence “Trump wishes electric car supporters ‘rot in hell’ in Truth Social Christmas message”, which appears highly negative, is assigned a positive sentiment score. Additionally, the range of sentiment scores does not provide a meaningful relative value for comparison between threads. For example, the sentence “EVs are cleaner than gas cars, but a growing share of Americans don’t believe it” is assigned a higher sentiment score than “Electric vehicle batteries are getting cheaper much faster than we expected”. However, the latter conveys a more positive tone in its context. This discrepancy highlights the limitations of the dictionary-based approach in capturing nuanced sentiment.

To enhance the accuracy of sentiment analysis, I used a pre-trained BERT model as implemented in the provided python script to analyze Reddit threads.

write_csv(thread_clean, "thread_only.csv")

thread_only_bert <- read_csv("thread_only_bert.csv")

## New names:
## Rows: 1132 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (5): title, text, subreddit, url, bert_label dbl (6): ...1, timestamp,
## comments, year, sentiment, bert_score date (1): date_utc
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

# drop NAs
thread_only_bert %<>% drop_na('bert_label')

# See 10 threads
bind_rows(
  head(thread_only_bert %>% 
         filter(bert_label %in% c('1 star', '2 stars'))
         , 4),
  head(thread_only_bert %>% 
         filter(bert_label %in% c('4 stars', '5 stars'))
         , 4),
  head(thread_only_bert %>% 
         filter(bert_label %in% c('3 stars'))
         , 2)
) %>%
  select(text, bert_label) %>%
  gt() %>%
  tab_header(
    title = "10 Reddit Threads by BERT Label") %>%
  cols_label(
    text = "Text",
    bert_label = "Bert Label",
  ) %>%
  tab_style(
    style = cell_text(weight = "bold", color = "black"),
    locations = cells_body(columns = c(text))
  ) %>%
  opt_table_lines()

Text	Bert Label
10 Reddit Threads by BERT Label
Lawsuit accuses Tesla of exaggerating driving range to boost EV sales .	1 star
Tariffs Aren't Going To Stop China's Affordable BYD EVs From Marching On Europe .	1 star
oMg i CaN't BeLiEvE iT .	1 star
Ford Is Getting Dealers to Agree to No-Haggle EV Sales .	1 star
DeLorean is back for the future with brand new EV .	5 stars
This guy parking in an EV Charging spot .	5 stars
Tesla Superchargers Finally Open To General Motors' EVs .	5 stars
Best EV Truck Around? I seem to think so! . Best EV around? I sure do love it! One of the best EVs around. Im a bit biased, but hard to beat 440 (460-470) range miles. 10k towing, air suspension and cool tech. Looks are subjective, but it sure does get a lot of attention - more so when I park at a Tesla SC station. Happy to own it.	5 stars
Tesla misses insurance firms Safest Cars list because its EVs dont crash often enough .	3 stars
Wife is set on an EV, which one to lease? . Her commute is about 80miles a day. If I had to choose one today it would be an i5 m60. But is that worth the cost over and i4? Are the i7 and iX worth it over the i5?	3 stars

From the BERT model’s results, one example rated as 5 stars, “This guy parking in an EV Charging spot”, does not appear particularly positive. However, the rest of the examples seem reasonable and align well with the intended sentiment. Compared to the dictionary-based method, I find the BERT model more effective in capturing nuanced sentiment, making it the preferred choice for this analysis.

8. Insights

# Extract year from the date column
thread_clean$year <- substr(thread_clean$date, 1, 4)

# Calculate median sentiment by year
median_sentiment <- thread_clean %>%
  group_by(year) %>%
  summarise(median_sentiment = median(sentiment))

# Create the plot
ggplot(thread_clean, aes(x = year, y = sentiment)) +
  geom_jitter(width = 0.2, alpha = 0.6, color = "gray") +
  geom_point(data = median_sentiment, aes(x = year, y = median_sentiment), 
             color = "red", size = 3, shape = 18) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") + 
  labs(title = "Sentiment by Year with Median Values from Dictionary Method",
       x = "Year",
       y = "Sentiment") +
  theme_minimal(base_size = 14)

The plot of sentiment distribution over the years reveals two key observations. First, the number of threads discussing electric vehicles has increased significantly over the past 10 years, suggesting growing public interest in the topic. Second, the sentiment distribution appears balanced between positive and negative sentiments from 2020 onwards. However, we cannot draw definitive conclusions from this observation due to the limitations of the sentiment scoring methods, as noted in the earlier analysis.

Then take a look at the results from BERT model.

# Extract year from the date column
thread_only_bert$year <- year(thread_only_bert$date_utc)

thread_only_bert %>%
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = 'fill') +
  scale_x_continuous(breaks = seq(min(thread_only_bert$year),
                                  max(thread_only_bert$year),
                                  by = 1)) +
  scale_fill_brewer(palette = 'PuRd', direction = -1) +
  labs(title = "Proportion of Sentiment by BERT label overtime") +
  theme_minimal()

The distribution of sentiment based on the BERT model reveals changes in sentiment towards electric vehicles over time. In 2015, sentiment was predominantly positive. However, between 2016 and 2018, sentiment shifted towards being more negative. Starting in 2019, sentiment gradually became more positive again. From 2020 onwards, the distribution appears balanced between positive and negative sentiments, indicating a more neutral stance or mixed opinions in recent years.

See what are most words in positive and negative sentiments

# Wordcloud with a custom color palette
n <- 20
h <- runif(n, 0, 1)
s <- runif(n, 0.6, 1) 
v <- runif(n, 0.3, 0.7) 

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 10000))

positive_word_cloud <- positive_texts %>%
  unnest_tokens(word, text, token='words') %>%
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[a-z]")) %>%
  filter(!word %in% keywords) %>%
  filter(!word %in% filter_words) %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal,
             minRotation = -pi/6,
             maxRotation = -pi/6,
             rotateRatio = 1)


negative_word_cloud <- negative_texts %>%
  unnest_tokens(word, text, token='words') %>%
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[a-z]")) %>%
  filter(!word %in% keywords) %>%
  filter(!word %in% filter_words) %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal,
             minRotation = -pi/6,
             maxRotation = -pi/6,
             rotateRatio = 1)

saveWidget(positive_word_cloud, "positive_word_cloud.html", selfcontained = TRUE)

saveWidget(negative_word_cloud, "negative_word_cloud.html", selfcontained = TRUE)

Positive Sentiment Word Cloud

Negative Sentiment Word Cloud

From the word clouds of positive and negative threads, two common words appear in both sentiments: China and sales. Words frequently associated with positive sentiment include de (likely a prefix or part of words), model, percent, range, market, and vehicle brands like BMW and Ford. Interestingly, names like Elon Musk and Trump are more prevalent in threads with negative sentiment, indicating these figures might be polarizing topics in the discussion.

Let’s take a look at the positive and negative sentiment from BERT model.

# Create Numeric BERT label
thread_only_bert %<>% 
  mutate(bert_label_numeric = str_sub(bert_label, 1, 1) %>% 
           as.numeric())

# Make Positive and Negative sentiment dataframes
thread_pos_bert <- thread_only_bert %>%
  filter(bert_label_numeric %in% c(4, 5))

thread_neg_bert <- thread_only_bert %>%
  filter(bert_label_numeric %in% c(1, 2))

bert_positive_word_cloud <- thread_pos_bert %>% 
  unnest_tokens(word, text, token='words') %>%
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[a-z]")) %>%
  filter(!word %in% keywords) %>%
  filter(!word %in% filter_words) %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal,
             minRotation = -pi/6,
             maxRotation = -pi/6,
             rotateRatio = 1)

bert_negative_word_cloud <- thread_neg_bert %>% 
  unnest_tokens(word, text, token='words') %>%
  # remove stop words
  anti_join(stop_words, by = "word") %>%
  filter(str_detect(word, "[a-z]")) %>%
  filter(!word %in% keywords) %>%
  filter(!word %in% filter_words) %>%
  count(word, sort=TRUE) %>%
  wordcloud2(color = pal,
             minRotation = -pi/6,
             maxRotation = -pi/6,
             rotateRatio = 1)

saveWidget(bert_positive_word_cloud, "bert_positive_word_cloud.html", selfcontained = TRUE)

saveWidget(bert_negative_word_cloud, "bert_negative_word_cloud.html", selfcontained = TRUE)

[BERT] Positive Sentiment Word Cloud

[BERT] Negative Sentiment Word Cloud

The results of common words in positive and negative sentiments using the BERT model are not significantly different from those obtained with the dictionary-based method at an overall level. However, the BERT model captures more meaningful words, such as battery, charge, range, and sales in positive sentiments. It also highlights polarizing topics like Elon Musk, Trump, and China in negative sentiments, providing deeper insights into the context and themes of discussions.

Conclusion

In summary, the sentiment analysis of Reddit threads and comments over the past 10 years reveals growing public interest in electric vehicles, as evidenced by the increasing number of discussions. While dictionary-based sentiment methods provided a general understanding, the BERT model offered more nuanced and meaningful insights, capturing specific themes such as battery, charging, and range in positive sentiments, and polarizing topics like Elon Musk, Trump, and China in negative sentiments. The analysis also highlights a shift in sentiment over time, from positive in 2015 to more negative in the 2016-2018 period, and then balancing out in recent years. These findings underscore the evolving perception of electric vehicles and the importance of leveraging advanced models like BERT for deeper and more accurate sentiment analysis.