Catresa Barlow & Wan-Ting Tsai
E-cigarettes entered the US markets in 2007 (Preventing Tobacco Addiction Foundation, 2019) and use among youth and young adults has increased steadily since the product’s introduction (Office of the Surgeon General, 2016). Sweet flavors like candy apple, bubble gum, marshmallow, cherry cola, smores, chocolate, orange soda, and taffy entice young people to try these products. Online availability of flavored tobacco products makes these products easily accessible by minors (Preventing Tobacco Addiction Foundation, 2019).
Tobacco use among youth and young adults represents a major public health concern in the United States. “More than 95% of addicted smokers start before age 21”. Nicotine changes the receptor in the adolescent brain and creates lifelong addiction. (Preventing Tobacco Addiction Foundation, 2019). Although the use of conventional tobacco products by youth and young adults declined in recent decades, the Centers for Disease Control reports an increase in the use of “emerging tobacco products” like e-cigarettes among this population (Office of the Surgeon General, 2016).
Tobacco companies employ social media to market their products to youth and young adults. Cessation and prevention campaigns require an understanding of how this population communicates about and uses these products. A key challenge for surveillance of the products and understanding their patterns of use is the diverse and nonstandard nomenclature for the devices (Alexander et al. 2016). These devices are referred to, by the companies themselves, and by consumers, as “e-cigarettes,” “e-cigs,” “cigalikes,” “e-hookahs,” “mods,” “vape pens,” “vapes,” and “tank systems.”
We hope to gain insight into tobacco marketing and patterns of use by analyzing social media platforms. Our project focuses on smoking among youth and young adults, and we believe Reddit is a platform used by this group. We wish to learn how tobacco companies communicate with this population and how this population communicates among itself on the subject of e-cigarettes.
Our team used RedditExtractoR to collect our data. We extracted posts and comments from subreddits using the following keywords - “vape”, “e-cigarettes”, “vaping”, “juul”, “e-cig”.Our dataset contains 365k rows and 18 columns.
reddit_comments <- read_csv("reddit_data.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## structure = col_character(),
## post_date = col_character(),
## comm_date = col_character(),
## num_comments = col_double(),
## subreddit = col_character(),
## upvote_prop = col_double(),
## post_score = col_double(),
## author = col_character(),
## user = col_character(),
## comment_score = col_double(),
## controversiality = col_double(),
## comment = col_character(),
## title = col_character(),
## post_text = col_character(),
## link = col_character(),
## domain = col_character(),
## URL = col_character()
## )
#check for unique rows
reddit_comments <- unique(reddit_comments)
kable(reddit_comments[1000, ], caption = "Raw Reddit Data") %>%
kable_styling(font_size = 10) %>%
scroll_box(width = "100%", height = "200px")
| id | structure | post_date | comm_date | num_comments | subreddit | upvote_prop | post_score | author | user | comment_score | controversiality | comment | title | post_text | link | domain | URL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 87 | 42_1 | 15-12-15 | 15-12-15 | 662 | electronic_cigarette | 0.88 | 79 | northwest_juice | northwest_juice | 1 | 0 | what is this?? Sounds dangerous and tasty | 990ml [giveaway]! + Buy One Get One Free |
TLDR: Buy One Get One Free, nwjuice.com CODE STILL ACTIVE THROUGH THE END OF TODAY DEC 16 Winners announced: **u/Sychophantom u/PeggsVaper u/tehdweeb u/Something_Berserker u/mccoog40 u/MMMWWW50 u/vapaiolo u/minetoo05 u/switzy u/TigerCR1200 u/RoadSurfer Please send us a message with your email so we can send you an invoice for your free product. There’s no need to send your address as you will need to fill this out on the invoice we send you. Congratulations to all of the winners and thank you for participating! Have a merry Christmas, Happy Hanukkah, Kwanzaa, etc. -Ian** Happy Holidays! Only 9 more days until that day of the year many of us look forward to! In celebration of the holidays we are hosting another giveaway. To enter, all you need to do is list your favorite aspect(s) of the holiday season. For me (Ian) personally, I enjoy the time I get to spend with family and watching my little brother open his gifts from Santa (not Krampus, please no). We will draw, then announce the winners of our giveaway at noon tomorrow (Dec. 16th). At least 11 winners will be selected to win 90ml of joooos from us (up to 3 variations). In addition to this giveaway, we will be doubling all orders for free when the code REDDITDOUBLE is entered or mentioned at checkout LINK (this will end tomorrow at midnight). If you have a preference on the free portion of your order, please leave a note, otherwise we will simply double the quantity of the juice you place an order for. We want to be able to guarantee delivery before xmas, so we will update this post if we receive a large volume of orders. If your order comes in after we post an update, we will send you an email notification as a courtesy. Always free us-shipping on orders over $10 NEW THINGS: We are having issues with our Hot Apple Cider flavor (in our opinion, no customers have mentioned any issues), so we have pulled it from the site while we attempt to fix it. Our last promotion gave us a backlog of orders which we have finally been able to deliver to all customers. If you had any issues, please contact us. Note: The coupon code does nothing other than let us know to double up your order, our checkout system isnt fancy enough to automatically apply a buy one get one free deal. Thank you for taking the time to read this. Good luck! and VAPE ON |
https://www.reddit.com/r/electronic_cigarette/comments/3wyvrr/990ml_giveaway_buy_one_get_one_free/ | self.electronic_cigarette | http://www.reddit.com/r/electronic_cigarette/comments/3wyvrr/990ml_giveaway_buy_one_get_one_free/?ref=search_posts |
kable(summary(reddit_comments), caption = "Data Summary") %>%
kable_styling(font_size = 10) %>%
scroll_box(width ="100%", height = "200px")
| id | structure | post_date | comm_date | num_comments | subreddit | upvote_prop | post_score | author | user | comment_score | controversiality | comment | title | post_text | link | domain | URL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Length:364467 | Length:364467 | Length:364467 | Min. : 83.0 | Length:364467 | Min. :0.2900 | Min. : 0 | Length:364467 | Length:364467 | Min. : -195.00 | Min. :0.00000 | Length:364467 | Length:364467 | Length:364467 | Length:364467 | Length:364467 | Length:364467 | |
| 1st Qu.: 78.0 | Class :character | Class :character | Class :character | 1st Qu.: 332.0 | Class :character | 1st Qu.:0.8400 | 1st Qu.: 70 | Class :character | Class :character | 1st Qu.: 1.00 | 1st Qu.:0.00000 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | |
| Median :172.0 | Mode :character | Mode :character | Mode :character | Median : 672.0 | Mode :character | Median :0.8800 | Median : 128 | Mode :character | Mode :character | Median : 1.00 | Median :0.00000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | |
| Mean :196.7 | NA | NA | NA | Mean : 845.5 | NA | Mean :0.8733 | Mean : 2910 | NA | NA | Mean : 11.63 | Mean :0.01709 | NA | NA | NA | NA | NA | NA | |
| 3rd Qu.:309.0 | NA | NA | NA | 3rd Qu.: 940.0 | NA | 3rd Qu.:0.9200 | 3rd Qu.: 593 | NA | NA | 3rd Qu.: 2.00 | 3rd Qu.:0.00000 | NA | NA | NA | NA | NA | NA | |
| Max. :500.0 | NA | NA | NA | Max. :12751.0 | NA | Max. :1.0000 | Max. :97636 | NA | NA | Max. :17588.00 | Max. :1.00000 | NA | NA | NA | NA | NA | NA |
#restructure data frame for comments
reddit_comment_df <- reddit_comments %>%
mutate(thread_id = URL) %>%
mutate(comm_date = dmy(comm_date)) %>%
mutate(comment = str_replace_all(comment, "<.+>", " ")) %>%
mutate(comment = str_replace_all(comment, "\\W", " ")) %>%
mutate(comment = str_replace_all(comment,
"www|https|http", " ")) %>%
mutate(comment_id = structure,
comment_score = as.numeric(comment_score)) %>%
select(comm_date, subreddit, thread_id, user,
comment_id, comment) %>%
unique() %>%
na.omit()
# restructure dataframe for title and post text
reddit_post_df <- reddit_comments %>%
mutate(comm_date = dmy(post_date)) %>%
mutate(thread_id = URL) %>%
mutate(user = author) %>%
mutate(comment_id = 0) %>%
mutate(post_text = replace_na(post_text, "."),
title = replace_na(title, ".")) %>%
mutate(post_text = str_replace_all(post_text,
"<.+>|www|https|http", " "),
title = str_replace_all(title,
"<.+>|www|https|http", " ")) %>%
mutate(text = str_c(title, post_text, sep = ",")) %>%
mutate(comment = text) %>%
mutate(comment = str_replace_all(comment, "\\W", " ")) %>%
select(comm_date, subreddit, thread_id, user,
comment_id, comment) %>%
unique() %>%
na.omit()
#combine comments and post text
reddit_df <- rbind(reddit_post_df, reddit_comment_df)
#display dataset
kable(reddit_df[1, ], caption = "Selected Reddit Data") %>%
kable_styling(font_size = 10) %>%
scroll_box(width = "100%", height = "200px")
| comm_date | subreddit | thread_id | user | comment_id | comment |
|---|---|---|---|---|---|
| 2016-07-04 | electronic_cigarette | http://www.reddit.com/r/electronic_cigarette/comments/4r8dhk/blue_dot_vapors_4th_of_july_giveaway_1_day_1_big/?ref=search_posts | BlueDotVapors | 0 | Blue Dot Vapors 4th of July giveaway 1 day 1 big winner Intro Hello ECR Happy 4th of July everyone We hope everyone stays safe and has fun Giveaway We are giving away 1776ml to one lucky winner To enter please copy and pasta d d You must be of legal smoking age to participate Winners will be selected at random Giveaway ends at 11 59pm PDT 8GMT 7 4 16 International entrees welcome Void where prohibited One entry per contestant Age will be verified before prizes are distributed Congratulations you won 1776ml of eliquid u KGKiddyDiddler Sale Until July 4th at midnight Pacific time we will have 20 off all orders and 50 off all presteep flavors with code MericasBirth We discounted all of the presteeps from 7 99 to 5 After the 20 discount this brings them to 4 and 50 off Please be aware that during sale times order processing can take up to 5 business days We have refined our processes and are faster than ever so we hope to have orders out quickly but this one looks like it will be pretty big so we are expecting somewhat of a backup News We have release 6 new flavors Black Cow Black Cow is an old timey Root Beer float with Vanilla Ice Cream A delicious and refreshing vape especially during hot weather Dark Matter We have solved one of the greatest scientific mysteries of the Universe We have captured Dark Matter and put it in vape form Our scientists have ascertained that it is composed of Peppermint Chocolate amp Cream Gentlemen Juice Gentleman and ladies will find this a welcome change from your typical watermelon menthol flavors Gentleman juice is a sophisticated Candied Watermelon with rich concord grape juice and a touch of menthol Refreshing and delicious Mountain Top Mountain Top started as a simple strawberry watermelon vape but after we added a touch of crisp Fuji Apple and a dash of menthol we knew this one had fully taken shape With a very slight cooling effect Mountain Top is a refreshing summer standout that is sure to satisfy Samoan Clouds Samoan Clouds is a toasted coconut caramel and chocolate cookie that is very reminiscent of the seasonal cookies Titan Titan is the finished product of months of work and over a dozen renditions This is a sweet apricot nectarine torte finished with a tiny touch of candied ginger A definitely unique flavor profile that will satisfy desert lovers and fruit lovers alike Unfortunately we are forced to stop shipping to Utah and Indiana at this time If you would like to see our products in a vape shop near you we welcome referrals Please forward our information to your local shops or send us an email at info bluedotvapors com Thank you to you all for your support Vape On Sincerely Team Blue Dot bluedotvapors com |
reddit_df %>%
group_by(month = floor_date(comm_date, "month")) %>%
summarize(comments = n()) %>%
ggplot(aes(month, comments)) +
geom_line() +
labs(title = "Comments Timeline") +
scale_x_date(date_labels = "%b/%y", date_breaks = "4 months") +
theme(axis.text.x = element_text(size = 7, angle = 45))
Explore Subreddit Threads
#number of comments
n_comments <- nrow(reddit_df)
#number of observations by subreddits
subreddits <- reddit_df %>%
count(subreddit) %>%
unique()
#number of subreddits
n_subreddits <- count(subreddits)
#number of threads
threads <-reddit_df %>%
count(thread_id) %>%
unique()
summary_df <- tibble("number_subreddits"=n_subreddits,
"number_threads"=n_distinct(threads),
"number_comments"=n_comments)
summary_df %>%
kable() %>%
kable_styling(c("striped", "bordered"),
full_width = FALSE, position = "left") %>%
row_spec(row = 1, align = "right")
| number_subreddits | number_threads | number_comments |
|---|---|---|
| 171 | 1087 | 328828 |
#subreddit with more than 1000 comments
reddit_df %>%
group_by(subreddit) %>%
summarize(comments = n()) %>%
filter(comments > 1000) %>%
ggplot(aes(subreddit, comments, fill = comments)) +
geom_col() +
coord_flip() +
labs(title = "Subreddits With Highest Number of Comments (>1000 comments)")
#display top 5 threads
top5 <- top_n(threads, 5) %>%
arrange(desc(n))
## Selecting by n
top5 %>%
kable(caption = "Top Five Thread by Comment Count") %>%
kable_styling(font_size = 8)
Explore Users
#user network
user_df <- reddit_comments %>%
group_by(user) %>%
summarise(n = n()) %>%
filter(n<500)
freq_users <- reddit_comments %>%
anti_join(user_df, by = "user") %>%
mutate(user = str_replace_all(user, "\\Q[deleted]\\E", " ")) %>%
group_by(user) %>%
summarise(n = n())
freq_users %>%
filter(user != " ") %>%
ggplot((aes(x = n, y = user, color = user))) +
geom_point() +
labs(x = "number of comments", y = "user", title = "Users With Highest Number of Comments (>500 comments)")
#xlab("number of comments") +
#ylab("freq_user")
User Network
#user network for smallest thread
target_df <- reddit_comments[which(reddit_comments$num_comments ==
min(reddit_comments$num_comments)), ]
network_list <- target_df %>%
user_network(include_author=FALSE, agg=TRUE)
network_list$plot
#tokenize
tokens <- reddit_df %>%
unnest_tokens(output = word, input = comment)
token_count <- tokens %>%
count(word,
sort = TRUE)
cleaned_tokens <- tokens %>%
anti_join(get_stopwords())
## Joining, by = "word"
#find numbers
nums <- cleaned_tokens %>%
filter(str_detect(word, "^[0-9]")) %>%
select(word) %>% unique()
#remove numbers
cleaned_tokens <- cleaned_tokens %>%
anti_join(nums, by = "word")
#length(unique(cleaned_tokens$word))
rare <- cleaned_tokens %>%
count(word) %>%
filter(n<10) %>%
select(word) %>%
unique()
cleaned_tokens <- cleaned_tokens %>%
anti_join(rare, by = "word")
letters <- cleaned_tokens %>%
filter(str_length(word) < 3) %>%
select(word) %>%
unique()
cleaned_tokens <- cleaned_tokens %>%
anti_join(letters, by = "word")
word_count <- cleaned_tokens %>%
count(word,
sort = TRUE)
word_by_subreddit <- cleaned_tokens %>%
count(subreddit, word, sort= TRUE) %>%
ungroup()
kable(head(word_count), caption = "word count") %>%
kable_styling(full_width=FALSE, position= "left")
| word | n |
|---|---|
| thanks | 42390 |
| like | 41007 |
| just | 37763 |
| can | 33102 |
| one | 25645 |
| people | 25335 |
kable(head(word_by_subreddit), caption = "word by subreddit") %>%
kable_styling()
| subreddit | word | n |
|---|---|---|
| electronic_cigarette | thanks | 40074 |
| electronic_cigarette | like | 21526 |
| electronic_cigarette | just | 17831 |
| electronic_cigarette | juice | 17313 |
| electronic_cigarette | giveaway | 16787 |
| electronic_cigarette | can | 15198 |
#histogram
cleaned_tokens %>%
count(word, sort = T) %>%
rename(word_freq = n) %>%
ggplot(aes(x=word_freq)) +
geom_histogram(aes(y=..count..),
color="black",
fill="blue",
alpha=0.3) +
scale_x_continuous(breaks=c(0:5,10,100,500,10e3),
trans="log1p",
expand=c(0,0)) +
scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4),
expand=c(0,0)) +
theme_bw() +
labs(title = "Word Frequency Histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Our team used text mining techniques to surveil e-cigarettes young users who post content on Reddit. The goal is to understand how youth and young adults discuss topics of reasons for use, harm perception, frequency of use, flavorings, ad exposure, and quitting experience.
We employed sentiment analysis algorithms to analysis positive and negative words/sentences used to describe e-cigarette products. We also analyzed word frequencies to compare frequencies across different subreddits to discover which words occur most often in discussions and identify emerging products and trends.
For our sentiment analysis, we used the bing and afinn lexicon. We examined overall words as well as word by subreddits.
comment_sentiment = cleaned_tokens %>%
left_join(get_sentiments("bing")) %>%
rename(bing = sentiment) %>%
left_join(get_sentiments("afinn")) %>%
rename(afinn = score)
## Joining, by = "word"
## Joining, by = "word"
#Sentiment Analysis - bing by word
bing_word_counts <- comment_sentiment %>%
filter(!is.na(bing)) %>%
count(word, bing, sort = TRUE)
bing_word_counts %>%
filter(n > 5000) %>%
mutate(n = ifelse(bing == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = bing)) +
geom_col() +
coord_flip() +
labs(y = "Contribution to sentiment - bing")
#Sentiment Analysis by subreddit - bing
subreddit_sent_bing <- word_by_subreddit %>%
inner_join(get_sentiments("bing"), by = "word") %>%
mutate(sentiment = ifelse(sentiment == "negative", -1, 1)) %>%
group_by(subreddit) %>%
summarize(sentiment = sum(sentiment * n) / sum(n))
subreddit_sent_bing %>%
top_n(30, abs(sentiment)) %>%
mutate(subreddit = reorder(subreddit, sentiment)) %>%
ggplot(aes(subreddit, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("Average bing sentiment score - bing")
#--------------------------------------------------------------
#Sentiment Analysis - afinn by word
afinn_word <- comment_sentiment %>%
filter(!is.na(afinn)) %>%
select(word, afinn) %>%
group_by(word) %>%
summarize(occurences = n(),
contribution = sum(afinn))
afinn_word %>%
top_n(25, abs(contribution)) %>%
mutate(word = reorder(word, contribution)) %>%
ggplot(aes(word, contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(y = "Contribution to sentiment - afinn")
#Sentiment Analysis - afinn by subreddit
subreddit_sentiment <- word_by_subreddit %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(subreddit) %>%
summarize(score = sum(score * n) / sum(n))
subreddit_sentiment %>%
top_n(30, abs(score)) %>%
mutate(subreddit = reorder(subreddit, score)) %>%
ggplot(aes(subreddit, score, fill = score > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("Average sentiment score - afinn")
# word correlation
uncommon <- cleaned_tokens %>%
count(word) %>%
filter(n<1000) %>%
select(word) %>%
unique()
word_cor = cleaned_tokens %>%
anti_join(uncommon, by = "word") %>%
widyr::pairwise_cor(word, thread_id) %>%
filter(!is.na(correlation),
correlation > .50)
# Visualizing the correlations
word_cor_plot <- word_cor %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
word_cor_plot
#Tokenizing by bi-gram
cleaned_text <- reddit_df %>%
#mutate(comment = str_replace_all(comment, "[0-9]", " ")) %>%
mutate(comment = str_replace_all(comment, "^[a-z]{0,2}$", " ")) %>%
select(subreddit, comment)
bigrams <- cleaned_text %>%
unnest_tokens(bigram, comment,
token = "ngrams", n = 2)
# bigram counts:
bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 1,593,954 x 2
## bigram n
## <chr> <int>
## 1 it s 33438
## 2 i m 31030
## 3 for the 28107
## 4 don t 25075
## 5 thanks for 21698
## 6 of the 18977
## 7 in the 18898
## 8 i ve 13857
## 9 if you 12879
## 10 and i 12711
## # … with 1,593,944 more rows
# filtering n-grams
bigrams_separated <- bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word &
str_length(word1)>1 &
word1 != "removed") %>%
filter(!word2 %in% stop_words$word &
str_length(word2)>1 &
word2 != "removed")
new_bigrams <-bigrams_filtered %>%
count(word1, word2, sort = TRUE)
# new bigram counts - first 5 rows:
kable(head(new_bigrams), caption = "Bi-gram") %>%
kable_styling(full_width=FALSE, position= "left")
| word1 | word2 | n |
|---|---|---|
| quit | smoking | 2714 |
| message | compose | 1382 |
| awesome | giveaway | 1356 |
| max | vg | 1298 |
| amp | nbsp | 1213 |
| box | mod | 1207 |
#-----------------------------------------------
#Tokenizing by tri-gram
trigrams <- reddit_df %>%
mutate(comment = str_replace_na(comment)) %>%
mutate(comment = str_replace_all(comment, "[0-9]|NA", " ")) %>%
unnest_tokens(trigram, comment,
token = "ngrams", n = 3)
#trigrams %>% select(trigram)
# trigram counts:
#trigrams %>%
# count(trigram, sort = TRUE)
# filtering n-grams
trigrams_separated <- trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
trigrams_filtered <- trigrams_separated %>%
filter(!word1 %in% stop_words$word & is.na(word1) == FALSE) %>%
filter(!word2 %in% stop_words$word & is.na(word2) == FALSE) %>%
filter(!word3 %in% stop_words$word & is.na(word3) == FALSE)
new_trigram <- trigrams_filtered %>%
count(word1, word2, word3, sort = TRUE)
# new trigram counts:
kable(head(new_trigram), caption = "Tri-gram") %>%
kable_styling()
| word1 | word2 | word3 | n |
|---|---|---|---|
| subreddit | message | compose | 1131 |
| reviews | official | site | 876 |
| love | donuts | mg | 647 |
| mg | max | vg | 624 |
| wikipedia | org | wiki | 596 |
| evic | vtc | mini | 559 |
# define a nice color palette
pal <- brewer.pal(8,"Dark2")
# plot the 150 most common words
trigrams_filtered %>%
count(word3) %>%
with(wordcloud(word3, n, scale=c(4,.1), min.freq=50, max.words = 150,
random.order = FALSE, rot.per=.15, colors=pal))
trigrams_filtered %>%
count(word2) %>%
with(wordcloud(word2, n, scale=c(4,.1), min.freq=50, max.words=150,
random.order=FALSE, rot.per=.15,
colors=pal))
trigrams_filtered %>%
count(word1) %>%
with(wordcloud(word1, n, scale=c(4,.1), min.freq=50, max.words = 150,
random.order = FALSE, rot.per=.15, colors=pal))
Word Frequency
#Document(thread) Term Matrix
#Post Term Matrix
word_counts_by_thread_id <- cleaned_tokens %>%
group_by(thread_id) %>%
count(word, sort = TRUE)
review_dtm <- word_counts_by_thread_id %>%
cast_dtm(thread_id, word, n)
review_dtm
## <<DocumentTermMatrix (documents: 1087, terms: 17682)>>
## Non-/sparse entries: 1255964/17964370
## Sparsity : 93%
## Maximal term length: 49
## Weighting : term frequency (tf)
#tf-idf
tfidf <- word_counts_by_thread_id %>%
bind_tf_idf(word, thread_id, n)
top_tfidf = tfidf %>%
group_by(thread_id) %>%
arrange(desc(tf_idf))
kable(head(top_tfidf), caption = "Term Frequency by Thread") %>%
kable_styling(font_size = 9)
Word Frequency by Subreddit
tf_idf <- word_by_subreddit %>%
bind_tf_idf(word, subreddit, n) %>%
arrange(desc(tf_idf))
tf_idf %>%
filter(str_detect(subreddit, "^vap|^Vap|^ele|^sci|^teen|^ecig|^Inn")) %>%
group_by(subreddit) %>%
top_n(12, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf-idf)) %>%
ggplot((aes(word, tf_idf, fill = subreddit))) +
geom_col(show.legend = FALSE) +
facet_wrap( ~ subreddit, scale = "free") +
ylab("tf-idf") +
coord_flip() +
labs(title = "Term Frequency By Subreddit")
subreddit_cors <- word_by_subreddit %>%
pairwise_cor(subreddit, word, n, sort = TRUE)
head(subreddit_cors)
## # A tibble: 6 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 worldnews news 0.972
## 2 news worldnews 0.972
## 3 technology worldnews 0.948
## 4 worldnews technology 0.948
## 5 unitedkingdom worldnews 0.948
## 6 worldnews unitedkingdom 0.948
set.seed(1231)
subreddit_cors %>%
filter(correlation > .90) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation, width = correlation)) +
geom_node_point(size = 6, color = "lightblue") +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
lda5 <- LDA(review_dtm, k = 5, control = list(seed = 1234))
topics <- terms(lda5, 100)
kable(head(topics)) %>%
kable_styling()
| Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 |
|---|---|---|---|---|
| juice | like | nicotine | just | thanks |
| com | just | people | like | giveaway |
| thanks | one | can | can | love |
| mini | can | just | get | good |
| name | get | smoking | vaping | awesome |
| smok | battery | like | one | thank |
Sentiment Analysis
We analyzed sentiment by word and subreddit and found that the majority of words in electronic cigarette posts and comments were positive. Users used words like “good”, “awesome”, “happy”, and “love” in discussions. Most subreddits also had an overall positive sentiment.
Relationships Between Words
Bigram - We noted strong relationships between words such as “quit and smoking”, “awesome and giveaway”, “started and vaping”. More analysis is needed to gain information about the reasons/motivations for those who wish to quit. Also, we would like to understand the role of giveaways in encouraging young smokers to start smoking/vaping.
Trigram - The trigram analysis did not reveal any significant trend related to smoking or e-cigarettes. Additional analysis is required.
Word and Document Frequency - Our analysis revealed that words used frequently in discussions were “juice”, “vaping”, and “giveaway”. Young e-cigarette users are attracted to flavored products or (juice). Additional analysis is needed to understand how these products are discussed. Also, the frequency of the word “giveaway” suggests that free products are being offered often as an enticement for users to begin smoking.
Topic Modeling - The results from the topic models are consistent with the other analyses completed. All topics contain similar e-cigarette related words.