# Package names
packages <- c("RedditExtractoR", "anytime", "magrittr", "ggplot2", "dplyr", "tidytext", "tidyverse", "igraph", "ggraph", "tidyr", "wordcloud2", "textdata", "sf", "tmap")

# Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

Objective

Changes in sentiment towards Lactate Threshold over the past 7 years.

On December 15, 2017, SweatElite, a group dedicated to examining and sharing the training approaches of elite athletes worldwide, released the first in-depth article on the training methodologies employed by Jakob Ingebrigtsen, a promising young runner from Norway. This was the first detailed article on the ‘Lactate Threshold’ training method I could find online, which matches the time period of the first appearance of the phrase on Reddit as well.

In the article, they mention “Norwegian long-distance runners are on the rise in athletics scene in the last two to three years. Especially the success of the three Ingebritsen brothers and the former European marathon record holder Sondre Moen. Below are some more references from the article for those interested.

During winter Ingrid but also other Norwegian runners are forced to train a lot indoors on a treadmill, or to do cross-country skiing, because of the cold weather conditions there. Ingrid trained in her peak weeks up to 170 to 200 kilometres per week and did a lot threshold training likely the Ingebritsen nowadays. This could vary from 2 times 15min or 3 times 10min in lactate threshold zone to different forms of intervals like 2000m times 5 and so on. In my opinion Ingrid implemented the knowledge of the more scientific advanced training of the cross-country skiers in Norway in her own training.

The success of the Norwegian runners shows that it is possible to compete as a European with the best in the world, regardless of genetics etc. If someone has the talent and the knowledge about the right training, the hard work what it takes and starts early enough like Jakob I., it’s possible to compete at the same level as the east African runners.

Since the publication of this article, the term ‘Lactate Threshold’ has gained popularity and is now utilized by athletes globally, regardless of their level of experience. This project seeks to examine Reddit, an inclusive platform accessible to individuals worldwide, to assess whether the observed trend, a phenomenon I’ve encountered as a Division I runner, can be identified through text and sentiment analysis.

https://www.sweatelite.co/norwegian-distance-running-training-insights-ingebritsen-brothers-etc/

Search Reddit threads using a keyword

threads_1 <- find_thread_urls(keywords = "lactate threshold", 
                              sort_by = 'relevance', 
                              period = 'all')
colnames(threads_1)
head(threads_1)
save(threads_1, file = "/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/threads.RData")

Visualize the data for context.

We’re interested in the change in appearance and sentiment towards Lactate Thresholds throughout the years. Thus, here we show the rise in the discussion of Lactate Threshold training on the Reddit platform.

We see that around February 2022, we start to see a rise in count for the keyword ‘Lactate Threshold’, and it continues to grow today.

load("/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/threads.RData")
# create new column: date
threads_1 %<>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date))


threads_1 %>% 
  ggplot(aes(x = date)) +
  geom_histogram(color="black", position = 'stack', binwidth = 60*60*24*7) +
  stat_density(geom = "line", aes(y = ..scaled..), color = "red") + 
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(threads_1$date, na.rm = TRUE), 
                                max(threads_1$date, na.rm = TRUE), 
                                by = "6 months")) +  # Change '1 month' to '6 months'
  theme_minimal()

write.csv(threads_1, "/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/Reddit.csv", row.names = FALSE)

Clean and tokenize the text data

Before Tokenization

# Tokenization (word tokens)
words <- threads_1 %>% 
  unnest_tokens(output = word, input = text, token = "words")

words %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

# load list of stop words - from the tidytext package
data("stop_words")
# view random 50 words
print(stop_words$word[sample(1:nrow(stop_words), 50)])
##  [1] "concerning" "is"         "be"         "greatest"   "with"      
##  [6] "things"     "anywhere"   "second"     "placed"     "furthers"  
## [11] "pointing"   "other"      "g"          "asked"      "toward"    
## [16] "saw"        "consider"   "an"         "young"      "yourselves"
## [21] "whole"      "better"     "turning"    "too"        "every"     
## [26] "back"       "have"       "area"       "everywhere" "there's"   
## [31] "showing"    "together"   "hereby"     "seems"      "facts"     
## [36] "seemed"     "do"         "indicate"   "of"         "herself"   
## [41] "anyone"     "now"        "whoever"    "everyone"   "wanted"    
## [46] "while"      "our"        "me"         "shouldn't"  "formerly"
# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean <- threads_1 %>% 
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
## Before: 24692, After: 8809

After Tokenization

words_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

Word cloud

library(dplyr)
n <- 20
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 1000))

words_clean_except <- words_clean %>%
  filter(!grepl("threshold|lactate", word, ignore.case = TRUE))
  
words_clean_except %>% 
  count(word, sort = TRUE) %>% 
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.8)

Tri-gram analysis

In addition to the evident connections among ‘lactate,’ ‘threshold,’ and ‘test,’ which collectively define the increasingly embraced training method referred to as the ‘lactate threshold test,’ trigrams containing the term ‘heart’ prominently come to the forefront. Within this trigram, ‘heart’ is intricately linked to diverse phrases, including ‘heart rate,’ ‘max heart rate,’ ‘heart rate zones,’ and ‘heart rate monitor.’

Notably, the term ‘pace’ emerges as a pivotal link, connecting the concept of ‘lactate threshold’ with the phrases revolving around the term ‘heart.’ This observation suggests an interplay between physiological thresholds and heart-related metrics in the realm of training methodologies.

One can also observe the prominent words surrounding biotechnologies used to measure heart rate and lactate such as ‘garmin connect app’, and ‘chest strap’.

#get ngrams. You may try playing around with the value of n, n=3 , n=4
words_ngram <- threads_1 %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCCII, google 'ASCII table'
library(stringi)
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 10) %>% 
  knitr::kable()
word1 word2 word3 n
lactate threshold test 59
max heart rate 28
heart rate zones 23
lactate threshold pace 16
heart rate monitor 13
guided lactate threshold 9
lactate threshold heart 7
threshold heart rate 7
average heart rate 6
chest heart rate 4
words_counts %>%
  filter(n >= 2) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks",
       x = "", y = "")

Pre and Post February 2022 (Spike in Keyword appearance)

Pre-February 2022

The trigram at the bottom illustrates the frequency of word groupings before the rise in popularity of ‘lactate threshold’ training, especially in endurance sports like running (Pre February 2022). The only strong relations we see amongst words before February 2022 are more simple training concepts such as ‘max heart rate’, ‘traditional tempo’, and ‘lactate threshold’. One can see that the variety of trigrams are limited when compared to the full trigram that spans the 2016-2023 period.

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean_pre <- threads_1 %>% 
  filter(date < as.Date("2022-02-01")) %>%
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
## Before: 24692, After: 8809
words_clean_pre %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

#get ngrams. You may try playing around with the value of n, n=3 , n=4
words_ngram <- threads_1 %>%
  filter(date < as.Date("2022-02-01")) %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCCII, google 'ASCII table'
library(stringi)
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the new bi-gram (n=2) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 10) %>% 
  knitr::kable()
word1 word2 word3 n
lactate threshold test 14
lactate threshold pace 7
heart rate zones 4
max heart rate 4
lactate threshold heart 3
threshold heart rate 3
determining lactate threshold 2
garmin connect app 2
heart rate graph 2
heart rate monitor 2
library(patchwork)
words_counts %>%
  filter(n >= 2) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks n >= 2",
       x = "", y = "")

Post-February 2022

The trigram at the bottom illustrates the frequency of word groupings throughout the rise in popularity of ‘lactate threshold’ training (Post February 2022). Compared to the trigram of pre-February 2022, we can see that the variety of trigrams have increased, encompassing phrases that illustrate more sophisticated training methodologies such as ‘guided lactate threshold’ and ‘blood lactate testing’.

# Regex that matches URL-type string
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

words_clean_post <- threads_1 %>% 
  filter(date >= as.Date("2022-02-01")) %>%
  # drop URLs
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  # Tokenization (word tokens)
  unnest_tokens(word, text, token = "words") %>% 
  # drop stop words
  anti_join(stop_words, by = "word") %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word, "[a-z]"))

# Check the number of rows after removal of the stop words. There should be fewer words now
print(
  glue::glue("Before: {nrow(words)}, After: {nrow(words_clean)}")
)
## Before: 24692, After: 8809
words_clean_post %>%
  count(word, sort = TRUE) %>%
  top_n(20, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "words",
       y = "counts",
       title = "Unique wordcounts")

#get ngrams. You may try playing around with the value of n, n=3 , n=4
words_ngram <- threads_1 %>%
  filter(date >= as.Date("2022-02-01")) %>%
  mutate(text = str_replace_all(text, replace_reg, "")) %>%
  select(text) %>%
  unnest_tokens(output = paired_words,
                input = text,
                token = "ngrams",
                n = 3)

words_ngram_pair <- words_ngram %>%
  separate(paired_words, c("word1", "word2", "word3"), sep = " ")

# filter rows where there are stop words under word 1 column and word 2 column
words_ngram_pair_filtered <- words_ngram_pair %>%
  # drop stop words
  filter(!word1 %in% stop_words$word & !word2 %in% stop_words$word & !word3 %in% stop_words$word) %>% 
  # drop non-alphabet-only strings
  filter(str_detect(word1, "[a-z]") & str_detect(word2, "[a-z]"))

# Filter out words that are not encoded in ASCII
# To see what's ASCCII, google 'ASCII table'
library(stringi)
words_ngram_pair_filtered %<>% 
  filter(stri_enc_isascii(word1) & stri_enc_isascii(word2) & stri_enc_isascii(word3))

# Sort the new bi-gram (n=2) counts:
words_counts <- words_ngram_pair_filtered %>%
  count(word1, word2, word3) %>%
  arrange(desc(n))

head(words_counts, 20) %>% 
  knitr::kable()
word1 word2 word3 n
lactate threshold test 45
max heart rate 24
heart rate zones 19
heart rate monitor 11
lactate threshold pace 9
guided lactate threshold 8
average heart rate 5
lactate threshold heart 4
min lt 4 4
threshold heart rate 4
chest heart rate 3
current lactate threshold 3
garmin lactate threshold 3
heart rate zone 3
lactate threshold hr 3
lactate threshold testing 3
min lt week 3
minute lactate threshold 3
rate zones based 3
1080format pngauto webps 2
library(patchwork)
words_counts %>%
  filter(n >= 2) %>%
  graph_from_data_frame() %>% # convert to graph
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = .6, edge_width = n)) +
  geom_node_point(color = "darkslategray4", size = 3) +
  geom_node_text(aes(label = name), vjust = 1.8) +
  labs(title = "Word Networks n >= 2",
       x = "", y = "")

Sentiment analysis (BERT & Dictionary methods)

I first used the BERT method and compared it to the dictionary method to see how the different approaches perform for this specific dataset.

I found that with a correlation coefficient of 0.29, the two methods indicates a mild positive correlation. Examining the scatter plot below, it becomes apparent that threads rated 4-5 stars by the BERT model predominantly fall above 0 in the dictionary method. This positive alignment suggests that both the BERT and dictionary methods are able to identify the sentiments similarly, and that we can confidently proceed with the dictionary method, as outlined in the assignment prompt.

reddit_sentiment <- read_csv('/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/Reddit_bert.csv') %>% 
  drop_na('bert_label')

Get sentiment scores using the dictionary method for comparison.

reddit_sentiment <- reddit_sentiment %>% 
  mutate(title = replace_na(title, ""),
         text = replace_na(text, ""),
         title_text = str_c(title, text, sep = ". "))

library(sentimentr)
reddit_sentiment_dictionary <- sentiment_by(reddit_sentiment$title_text) # by string (a group of sentences)

reddit_sentiment$sentiment_dict <- reddit_sentiment_dictionary %>% pull(ave_sentiment)
reddit_sentiment$word_count <- reddit_sentiment_dictionary %>% pull(word_count)

save(reddit_sentiment, file = "/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/reddit_sentiment.RData")

Check the correlation between the sentiment values from two different methods.

load("/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/reddit_sentiment.RData")
reddit_sentiment <- reddit_sentiment %>% mutate(bert_label_numeric = str_sub(bert_label, 1, 1) %>% as.numeric())

cor(reddit_sentiment$bert_label_numeric, reddit_sentiment$sentiment_dict)
## [1] 0.2933851
ggplot(data = reddit_sentiment, aes(x = bert_label_numeric, y = sentiment_dict)) +
  geom_jitter(width = 0.1, height = 0, color = 'Black') +
  geom_line(aes(y = 0), color = '#FFD700', lwd = 1, linetype='dashed')+
  geom_smooth(method = "lm", se = FALSE, color = "gray")

data("stop_words")
replace_reg <- "http[s]?://[A-Za-z\\d/\\.]+|&amp;|&lt;|&gt;"

reddit_sentiment_clean <- reddit_sentiment %>% 
  mutate(title_text = str_replace_all(title_text, replace_reg, "")) %>%
  unnest_tokens(word, title_text, token = "words") %>% 
  anti_join(stop_words, by = "word") %>% 
  filter(str_detect(word, "[a-z]")) %>% 
  filter(!word %in% c('lactate','threshold'))

We are not interested in words that are commonly seen in both positive and negative threads. We can identify words that are uniquely seen in either positive or negative threads using anti_join.

reddit_sentiment_clean_negative <- reddit_sentiment_clean %>% 
  filter(bert_label_numeric %in% c(1,2))
reddit_sentiment_clean_positive <- reddit_sentiment_clean %>% 
  filter(bert_label_numeric %in% c(4,5))

reddit_sentiment_clean_negative_unique <- reddit_sentiment_clean_negative %>% 
  anti_join(reddit_sentiment_clean_positive, by = 'word')
reddit_sentiment_clean_positive_unique <- reddit_sentiment_clean_positive %>%
  anti_join(reddit_sentiment_clean_negative, by = 'word')
  • Words appearing in negative threads
library(htmlwidgets)
library(webshot2)
devtools::install_github("lchiffon/wordcloud2")

n <- 50
h <- runif(n, 0, 1) # any color
s <- runif(n, 0.6, 1) # vivid
v <- runif(n, 0.3, 0.7) # neither too dark or bright

df_hsv <- data.frame(h = h, s = s, v = v)
pal <- apply(df_hsv, 1, function(x) hsv(x['h'], x['s'], x['v']))
pal <- c(pal, rep("grey", 1000))

reddit_sentiment_clean_negative_unique %>% 
  count(word, sort = TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.7)
# Word cloud for negative sentiment
wordcloud_negative <- reddit_sentiment_clean_negative_unique %>%
  count(word, sort = TRUE)%>%
  head(20)

library(knitr)
kable(as.data.frame(wordcloud_negative[, c("word", "n")]))
word n
reading 12
avg 8
stopped 8
current 7
difficult 7
failed 7
manual 7
ride 7
updated 7
hit 6
left 6
people 6
supposed 6
couldn 5
life 5
cool 4
hoping 4
incorrect 4
issue 4
lap 4
  • Words appearing in positive threads
reddit_sentiment_clean_positive_unique %>% 
  count(word, sort = TRUE) %>%
  wordcloud2(color = pal, 
             minRotation = 0, 
             maxRotation = 0, 
             ellipticity = 0.7)
# Word cloud for positive sentiment
wordcloud_negative <- reddit_sentiment_clean_positive_unique %>%
  count(word, sort = TRUE)%>%
  head(20)

library(knitr)
kable(as.data.frame(wordcloud_negative[, c("word", "n")]))
word n
athletes 8
runners 8
distance 5
rai 5
select 5
180bpm 4
addition 4
blog 4
comments 4
endurance 4
floatation 4
gray 4
percentage 4
research 4
share 4
steady 4
tinman 4
uptempo 4
146bpm 3
analysis 3

Evaluate credibility of the sentiment analysis

Looking at the sample texts and sentiment scores, it seems that the dictionary method that we employed in the analysis is mostly able to identify not only the sentiments of the individual words, but also negations.

An observation that is worth noting is that when it comes to detailed questions regarding the methodologies of implementing lactate threshold training, some questions are identified as positive and others negative, without a clear boundary for the classification. Detailed questions about training methodologies may involve nuanced language that goes beyond simple positive or negative expressions. For example, a question about lactate threshold training might involve both positive aspects (e.g., effectiveness, benefits) and potential challenges or concerns (e.g., difficulties, limitations). The method’s output may reflect the complexity of sentiments embedded in such questions.

library(knitr)
#sample_texts <- reddit_sentiment %>%
#  sample_n(5)  # Select 5 random rows

# Display the sampled texts and their sentiment scores
#head(sample_texts[, c("title_text", "sentiment_dict")])


sentimentr_example <- reddit_sentiment %>%
  mutate(sentimentr_abs = abs(sentiment_dict),
         sentimentr_binary = case_when(sentiment_dict > 0 ~ 'positive',
                                       TRUE ~ 'negative')) %>% 
  group_by(sentimentr_binary) %>%
  arrange(desc(sentimentr_abs)) %>%
  filter(comments>3)%>%
  slice_head(n = 5) %>%
  ungroup() %>% 
  arrange(sentiment_dict)

# negative
sentimentr_example %>% filter(sentimentr_binary == 'negative') %>% pull(title_text, sentiment_dict) %>% print()
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   -0.375 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  "Lactate Threshold seems inaccurate.. " 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       -0.255221410864953 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                "Why does my lactate threshold improves over time but my lactate threshold Pace worsens? Is it normal?. " 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       -0.198076768585035 
##                                                  "Has anyone gotten a Lactate Threshold update from cycling when using a Garmin Watch?. From reading online, it seems to update Lactate Threahold when using a Garmin Bike computer,  but I don\031t have one. I only have the Forerunner 965 but do not understand why it shouldn\031t also update it? Anyone else experience this?\n\nIt\031s hard to find info on this issue seeing as almost everyone who cycles uses a bike computer. But I don\031t feel the need for one as I only bike indoors." 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        -0.19656323333664 
##                                                                                                                                                                                                                                                                            "Matthew Ordish on Twitter: \"James Milner is an absolute machine. Beat everyone in the lactate threshold test today at Melwood. Naby dropped out with one more level to go. Lallana and Milner last two standing. Milner the only man left running laps at 24KMH. Tank.\". " 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       -0.187280155646902 
## "Lactate threshold too difficult. Anyone else find lactate threshold test too difficult. I run everyday for the most part. About 20-30 miles a week. 53 vo2. Max hr 195. I can\031t do the last two rounds on the threshold test and it\031s fucking aggravating. I tried twice. This last one waiting and rested over the weekend. My test goes 136-145 2x, 145-156, 165-175, 175-185(fail here first minute), 185-195. How the fuck are you suppose to run max heart rate after, I\031m gassed. So idk if it\031s me or that\031s just too ridiculous"
# positive
sentimentr_example %>% filter(sentimentr_binary == 'positive') %>% pull(title_text,sentiment_dict) %>% print()
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.346692972266557 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "Everyday runners using lactate threshold blood trackers?. Hey there \024 the title, says it all, but as more and more higher end training tools become accessible and available to every day runners, I wanted to see if anyone had any experience or anecdotal feedback from using these? \n\n\nIs it excessive? Sure! But just wanted to hear about anyone\031s experience with in, positive or negative. Much appreciated." 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.347179816544362 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        "Lactate threshold before starting a Garmin coach session. Hi All,\n\nI would like to know if it is advisable to do a Lactate threshold test and set hearth rate zones before I start with a Garmin coach training. \n\nThanks" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.357770876399966 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "Lactate threshold dropping, pace better. " 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.429193292325587 
## "Understanding Lactate Threshold Workouts. I am trying to gain a deeper understanding of the impact, and difference between different LT workouts - especially between cruise intervals and steady state runs.\n\nBoth Daniels and Pfitzinger speak about LT work in their books and the overarching benefit/place for them in a program, but neither really addresses the reasons why you would run different LT style workouts.\n\nTake for example these (randomly made-up to make my math easier) LT workouts:\n\n* 6 x 1m w/ 1min rest\n* 3 x 2m w/ 2min rest\n* 2 x 3m w/ 3min rest\n* 6m at LT\n\nIn all of these examples the LT work is 6 miles, but they are different workouts. My questions (and for discussion) relate mainly to what different training stimulus they provide, and when would you run one instead of another? Are some better earlier in a program, are some better for shorter races (say 5K vs Marathon), are some better at ultimately providing superior LT improvement, should they be used as a ladder as fitness improves moving from shorter internavls to steady state?\n\nJust trying to get a better grasp on how they fit into a program, and any other considerations when utilising LT runs that I may not have previously thought about." 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0.443760156980183 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            "UPDATE Zone Training &amp; Lactate Threshold Test: THANK YOU TO THIS AMAZING COMMUNITY.. "

Discussion of insights

The plots indicate a noticeable upward trend in sentiment surrounding the term ‘lactate threshold’ from 2016 to 2023. Although data points are scarce in 2016, there has been a consistent increase in the number of Reddit threads discussing the ‘Lactate Threshold’ over the years, aligning with our earlier observations. Despite the initial limited data, the positive trend in sentiments is evident.

# create new columns: date, year, day_of_week, is_weekend, time
library(lubridate)
reddit_sentiment <- reddit_sentiment %>% 
  mutate(date = as.POSIXct(date_utc)) %>%
  filter(!is.na(date)) %>% 
  mutate(year = year(date),
         day_of_week = wday(date, label = TRUE),
         is_weekend = ifelse(day_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday"),
         time = timestamp %>% 
           anytime(tz = anytime:::getTZ()) %>% 
           str_split('-| |:') %>% 
           sapply(function(x) as.numeric(x[4])))

save(reddit_sentiment, file = "/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/reddit_sentiment.RData")
load("/Users/helenalindsay/Documents/Fall_23/CP8883/Major4/reddit_sentiment.RData")
library(gridExtra)
reddit_sentiment$date <- as.POSIXct(reddit_sentiment$date, format = "%Y-%m-%d %H:%M:%S")
scatter <- reddit_sentiment %>% 
  ggplot(aes(x = date, y = sentiment_dict)) +
  geom_jitter(width = 0, height = 0.05, color = 'purple') +
  geom_smooth(method = "lm", se = TRUE, color = "gray") +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(reddit_sentiment$date, na.rm = T), 
                                max(reddit_sentiment$date, na.rm = T), 
                                by = "1 year"))

smooth <- reddit_sentiment %>% 
  ggplot(aes(x = date, y = sentiment_dict)) +
  geom_density_2d_filled(width = 0, height = 0.05, color = 'purple') +
  geom_smooth(method = "lm", se = TRUE, color = "gray") +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(reddit_sentiment$date, na.rm = T), 
                                max(reddit_sentiment$date, na.rm = T), 
                                by = "1 year"))

grid.arrange(scatter, smooth, ncol = 1)

Pre February 2022

The plots for before the rise in Reddit threads on the topic ‘Lactate Threshold’ indicate a noticeable upward trend in sentiment. Again, the number of data points are scarce in 2016, but the consistent increase in the positive trend in sentiments is undeniable.

pre_dict <- reddit_sentiment %>%
  filter(date < as.Date("2022-02-01"))%>%
  mutate(date = as.POSIXct(date_utc))

#pre_dict$date <- as.POSIXct(pre_dict$date, format = "%Y-%m-%d %H:%M:%S")

scatter_pre <- pre_dict %>% 
  ggplot(aes(x = date, y = sentiment_dict)) +
  geom_density_2d_filled(width = 0, height = 0.05, color = 'purple') +
  geom_smooth(method = "lm", se = TRUE, color = "gray") +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(pre_dict$date, na.rm = T), 
                                max(pre_dict$date, na.rm = T), 
                                by = "1 year"))

smooth_pre <- pre_dict %>% 
  ggplot(aes(x = date, y = sentiment_dict)) +
  geom_jitter(width = 0, height = 0.05, color = 'purple') +
  geom_smooth(method = "lm", se = TRUE, color = "gray") +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(pre_dict$date, na.rm = T), 
                                max(pre_dict$date, na.rm = T), 
                                by = "1 year"))

grid.arrange(scatter_pre, smooth_pre, ncol = 1)

Post February 2022

The graphical representation of the Reddit threads discussing the ‘Lactate Threshold’ training method reveals an intriguing pattern. Despite a consistent rise in the frequency of occurrences for this keyword on Reddit, observed from February 2022 to approximately August 2023, then a decline through the past few months, there is a noticeable flattening in the trend of sentiment. While the count of discussions on this training method continued to shift, the sentiment associated with these discussions seemed to stabilize at around 0.1 (indicating a positive sentiment).

post_dict <- reddit_sentiment %>%
  filter(date >= as.Date("2022-02-01"))

post_dict$date <- as.POSIXct(post_dict$date, format = "%Y-%m-%d %H:%M:%S")

scatter_post <- post_dict %>% 
  ggplot(aes(x = date, y = sentiment_dict)) +
  geom_density_2d_filled(width = 0, height = 0.05, color = 'purple') +
  geom_smooth(method = "lm", se = TRUE, color = "gray") +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(post_dict$date, na.rm = T), 
                                max(post_dict$date, na.rm = T), 
                                by = "1 year"))

smooth_post <- post_dict %>% 
  ggplot(aes(x = date, y = sentiment_dict)) +
  geom_jitter(width = 0, height = 0.05, color = 'purple') +
  geom_smooth(method = "lm", se = TRUE, color = "gray") +
  scale_x_datetime(date_labels = "%b %y",
                   breaks = seq(min(post_dict$date, na.rm = T), 
                                max(post_dict$date, na.rm = T), 
                                by = "1 year"))

grid.arrange(scatter_post, smooth_post, ncol = 1)

Extra Plots

# sentiment by year
reddit_sentiment %>% 
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = 'stack') +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), 
                                  max(reddit_sentiment$year), 
                                  by = 1)) +
  scale_fill_brewer(palette = 'PuRd', direction = -1)

# sentiment by year
reddit_sentiment %>% 
  ggplot(aes(x = year, fill = bert_label)) +
  geom_bar(position = 'fill') +
  scale_x_continuous(breaks = seq(min(reddit_sentiment$year), 
                                  max(reddit_sentiment$year), 
                                  by = 1)) +
  scale_fill_brewer(palette = 'PuRd', direction = -1) 

Sentiment by day of week.

# sentiment by day
reddit_sentiment %>% 
  ggplot(aes(x = day_of_week, fill = bert_label)) +
  geom_bar(position = 'fill') +
  scale_fill_brewer(palette = 'PuRd', direction = -1) 

Sentiment by time of day.

reddit_sentiment %>% 
  ggplot(aes(x = time, fill = bert_label)) +
  geom_histogram(bins = 24, position = 'fill', color = 'black', lwd = 0.2) +
  scale_x_continuous(breaks = seq(0, 24, by=1)) + 
  scale_fill_manual(values = c('#bc5090', '#bc5090', '#ff6361', '#ffa600', '#ffa600')) 

Conclusion

Since the publication of the SweatElite article, and the rise of Norwegian long-distance runners in the athletics scene, the term ‘Lactate Threshold’ has gained popularity and is now utilized by athletes globally. This project examined Reddit, an inclusive platform accessible to individuals worldwide, to assess whether the observed trend, a phenomenon I’ve encountered as a Division I runner, can be identified through text and sentiment analysis.

In short, it was a success. Below are a few notable takeaways.

  1. In February 2022, there was a notable surge in the usage of the term ‘Lactate Threshold’ on the Reddit platform. This surge occurred approximately four years after the first public article discussing the training insights of Norwegian athletes, indicating a lag in the dissemination of information and the subsequent adoption of the term within the Reddit community.

  2. The trigrams for pre and post February 2022 illustrated the rise in complexity of the words surrounding the phrase ‘lactate threshold’ throughout the years. The only strong relations amongst words before February 2022 were more simple training concepts such as ‘max heart rate’, ‘traditional tempo’, and the keyword itself, ‘lactate threshold’. One can see that the variety of trigrams were limited when compared to the full trigram that spans the 2016-2023 period. On the contrary, after February 2022, the variety of trigrams increased, encompassing phrases that illustrate more sophisticated training methodologies such as ‘guided lactate threshold’ and ‘blood lactate testing’.

  3. In addition to the clear associations among the terms ‘lactate,’ ‘threshold,’ and ‘test,’ collectively defining the widely adopted training method known as the ‘lactate threshold test,’ trigrams featuring the term ‘heart’ prominently surfaced. The term ‘pace’ emerged as a crucial link, bridging the concept of ‘lactate threshold’ with phrases centered around the term ‘heart.’ This observation implies an intricate interplay between physiological thresholds and metrics related to the heart in the context of training methodologies.

  4. Despite a consistent rise in the frequency of occurrences for the keyword on Reddit, observed from February 2022 to approximately August 2023, then a decline through the past few months, there was a noticeable flattening in the trend of sentiment. While the count of discussions on this training method continued to shift, the sentiment associated with these discussions seemed to stabilize at around 0.1 (indicating a positive sentiment). This could be interpreted as evidence that, following years of research and firsthand experience with the training methods, the Reddit community has collectively endorsed the effectiveness of Lactate Threshold training.

The findings could be interpreted as evidence that, after years of research and firsthand experience with the training methods spanning from 2016 to 2023, the Reddit community has collectively embraced and acknowledged the effectiveness of Lactate Threshold training. Continuously tracking trends and sentiments offers valuable insights into the evolution of opinions and practices over time. This analysis highlights Reddit’s potential as a valuable resource for observing trends and sentiments in the realm of athletics.