Team Members

Catresa Barlow & Wan-Ting Tsai

Business Context

E-cigarettes entered the US markets in 2007 (Preventing Tobacco Addiction Foundation, 2019) and use among youth and young adults has increased steadily since the product’s introduction (Office of the Surgeon General, 2016). Sweet flavors like candy apple, bubble gum, marshmallow, cherry cola, smores, chocolate, orange soda, and taffy entice young people to try these products. Online availability of flavored tobacco products makes these products easily accessible by minors (Preventing Tobacco Addiction Foundation, 2019).

Tobacco use among youth and young adults represents a major public health concern in the United States. “More than 95% of addicted smokers start before age 21”. Nicotine changes the receptor in the adolescent brain and creates lifelong addiction. (Preventing Tobacco Addiction Foundation, 2019). Although the use of conventional tobacco products by youth and young adults declined in recent decades, the Centers for Disease Control reports an increase in the use of “emerging tobacco products” like e-cigarettes among this population (Office of the Surgeon General, 2016).

Problem Description

Tobacco companies employ social media to market their products to youth and young adults. Cessation and prevention campaigns require an understanding of how this population communicates about and uses these products. A key challenge for surveillance of the products and understanding their patterns of use is the diverse and nonstandard nomenclature for the devices (Alexander et al. 2016). These devices are referred to, by the companies themselves, and by consumers, as “e-cigarettes,” “e-cigs,” “cigalikes,” “e-hookahs,” “mods,” “vape pens,” “vapes,” and “tank systems.”

We hope to gain insight into tobacco marketing and patterns of use by analyzing social media platforms. Our project focuses on smoking among youth and young adults, and we believe Reddit is a platform used by this group. We wish to learn how tobacco companies communicate with this population and how this population communicates among itself on the subject of e-cigarettes.

Data Summary, Exploration, and Discussion

Our team used RedditExtractoR to collect our data. We extracted posts and comments from subreddits using the following keywords - “vape”, “e-cigarettes”, “vaping”, “juul”, “e-cig”.Our dataset contains 365k rows and 18 columns.

Load Data and Pre-process Data

reddit_comments <- read_csv("reddit_data.csv")
## Parsed with column specification:
## cols(
##   id = col_double(),
##   structure = col_character(),
##   post_date = col_character(),
##   comm_date = col_character(),
##   num_comments = col_double(),
##   subreddit = col_character(),
##   upvote_prop = col_double(),
##   post_score = col_double(),
##   author = col_character(),
##   user = col_character(),
##   comment_score = col_double(),
##   controversiality = col_double(),
##   comment = col_character(),
##   title = col_character(),
##   post_text = col_character(),
##   link = col_character(),
##   domain = col_character(),
##   URL = col_character()
## )
#check for unique rows
reddit_comments <- unique(reddit_comments)
kable(reddit_comments[1000, ], caption = "Raw Reddit Data") %>%
  kable_styling(font_size = 10) %>%
  scroll_box(width = "100%", height = "200px")
Raw Reddit Data
id structure post_date comm_date num_comments subreddit upvote_prop post_score author user comment_score controversiality comment title post_text link domain URL
87 42_1 15-12-15 15-12-15 662 electronic_cigarette 0.88 79 northwest_juice northwest_juice 1 0 what is this?? Sounds dangerous and tasty 990ml [giveaway]! + Buy One Get One Free

TLDR: Buy One Get One Free, nwjuice.com

CODE STILL ACTIVE THROUGH THE END OF TODAY DEC 16 Winners announced:

**u/Sychophantom

u/PeggsVaper

u/tehdweeb

u/Something_Berserker

u/mccoog40

u/MMMWWW50

u/vapaiolo

u/minetoo05

u/switzy

u/TigerCR1200

u/RoadSurfer

Please send us a message with your email so we can send you an invoice for your free product. There’s no need to send your address as you will need to fill this out on the invoice we send you.

Congratulations to all of the winners and thank you for participating! Have a merry Christmas, Happy Hanukkah, Kwanzaa, etc.

-Ian**

Happy Holidays!

Only 9 more days until that day of the year many of us look forward to! In celebration of the holidays we are hosting another giveaway. To enter, all you need to do is list your favorite aspect(s) of the holiday season. For me (Ian) personally, I enjoy the time I get to spend with family and watching my little brother open his gifts from Santa (not Krampus, please no). We will draw, then announce the winners of our giveaway at noon tomorrow (Dec. 16th). At least 11 winners will be selected to win 90ml of joooos from us (up to 3 variations). In addition to this giveaway, we will be doubling all orders for free when the code REDDITDOUBLE is entered or mentioned at checkout LINK (this will end tomorrow at midnight). If you have a preference on the free portion of your order, please leave a note, otherwise we will simply double the quantity of the juice you place an order for. We want to be able to guarantee delivery before xmas, so we will update this post if we receive a large volume of orders. If your order comes in after we post an update, we will send you an email notification as a courtesy.

Always free us-shipping on orders over $10

NEW THINGS:

We are having issues with our Hot Apple Cider flavor (in our opinion, no customers have mentioned any issues), so we have pulled it from the site while we attempt to fix it.

Our last promotion gave us a backlog of orders which we have finally been able to deliver to all customers. If you had any issues, please contact us.

Note: The coupon code does nothing other than let us know to double up your order, our checkout system isnt fancy enough to automatically apply a buy one get one free deal.

Thank you for taking the time to read this. Good luck! and VAPE ON
https://www.reddit.com/r/electronic_cigarette/comments/3wyvrr/990ml_giveaway_buy_one_get_one_free/ self.electronic_cigarette http://www.reddit.com/r/electronic_cigarette/comments/3wyvrr/990ml_giveaway_buy_one_get_one_free/?ref=search_posts
kable(summary(reddit_comments), caption = "Data Summary") %>%
  kable_styling(font_size = 10) %>%
  scroll_box(width ="100%", height = "200px")  
Data Summary
id structure post_date comm_date num_comments subreddit upvote_prop post_score author user comment_score controversiality comment title post_text link domain URL
Min. : 1.0 Length:364467 Length:364467 Length:364467 Min. : 83.0 Length:364467 Min. :0.2900 Min. : 0 Length:364467 Length:364467 Min. : -195.00 Min. :0.00000 Length:364467 Length:364467 Length:364467 Length:364467 Length:364467 Length:364467
1st Qu.: 78.0 Class :character Class :character Class :character 1st Qu.: 332.0 Class :character 1st Qu.:0.8400 1st Qu.: 70 Class :character Class :character 1st Qu.: 1.00 1st Qu.:0.00000 Class :character Class :character Class :character Class :character Class :character Class :character
Median :172.0 Mode :character Mode :character Mode :character Median : 672.0 Mode :character Median :0.8800 Median : 128 Mode :character Mode :character Median : 1.00 Median :0.00000 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
Mean :196.7 NA NA NA Mean : 845.5 NA Mean :0.8733 Mean : 2910 NA NA Mean : 11.63 Mean :0.01709 NA NA NA NA NA NA
3rd Qu.:309.0 NA NA NA 3rd Qu.: 940.0 NA 3rd Qu.:0.9200 3rd Qu.: 593 NA NA 3rd Qu.: 2.00 3rd Qu.:0.00000 NA NA NA NA NA NA
Max. :500.0 NA NA NA Max. :12751.0 NA Max. :1.0000 Max. :97636 NA NA Max. :17588.00 Max. :1.00000 NA NA NA NA NA NA
#restructure data frame for comments
reddit_comment_df <- reddit_comments %>%
  mutate(thread_id = URL) %>%
  mutate(comm_date = dmy(comm_date)) %>%
  mutate(comment = str_replace_all(comment,  "<.+>", " ")) %>%
  mutate(comment = str_replace_all(comment,  "\\W", " ")) %>%
  mutate(comment = str_replace_all(comment,  
                                   "www|https|http", " ")) %>%
  mutate(comment_id = structure,
         comment_score = as.numeric(comment_score)) %>%
  select(comm_date, subreddit, thread_id, user, 
         comment_id, comment) %>% 
  unique() %>%
  na.omit()

# restructure dataframe for title and post text
reddit_post_df <- reddit_comments %>%
  mutate(comm_date = dmy(post_date)) %>%
  mutate(thread_id = URL) %>%
  mutate(user = author) %>%
  mutate(comment_id = 0) %>%
  mutate(post_text = replace_na(post_text, "."),
         title = replace_na(title, ".")) %>%
  mutate(post_text = str_replace_all(post_text,  
                                     "<.+>|www|https|http", " "),
         title = str_replace_all(title,  
                                   "<.+>|www|https|http", " ")) %>%
  mutate(text = str_c(title, post_text, sep = ",")) %>%
  mutate(comment = text) %>%
  mutate(comment = str_replace_all(comment, "\\W", " ")) %>%
  select(comm_date, subreddit, thread_id, user, 
         comment_id, comment) %>%
  unique() %>%
  na.omit()

#combine comments and post text
reddit_df <- rbind(reddit_post_df, reddit_comment_df)

#display dataset
kable(reddit_df[1, ], caption = "Selected Reddit Data") %>%
  kable_styling(font_size = 10) %>%
  scroll_box(width = "100%", height = "200px")
Selected Reddit Data
comm_date subreddit thread_id user comment_id comment
2016-07-04 electronic_cigarette http://www.reddit.com/r/electronic_cigarette/comments/4r8dhk/blue_dot_vapors_4th_of_july_giveaway_1_day_1_big/?ref=search_posts BlueDotVapors 0 Blue Dot Vapors 4th of July giveaway 1 day 1 big winner Intro Hello ECR Happy 4th of July everyone We hope everyone stays safe and has fun Giveaway We are giving away 1776ml to one lucky winner To enter please copy and pasta d d You must be of legal smoking age to participate Winners will be selected at random Giveaway ends at 11 59pm PDT 8GMT 7 4 16 International entrees welcome Void where prohibited One entry per contestant Age will be verified before prizes are distributed Congratulations you won 1776ml of eliquid u KGKiddyDiddler Sale Until July 4th at midnight Pacific time we will have 20 off all orders and 50 off all presteep flavors with code MericasBirth We discounted all of the presteeps from 7 99 to 5 After the 20 discount this brings them to 4 and 50 off Please be aware that during sale times order processing can take up to 5 business days We have refined our processes and are faster than ever so we hope to have orders out quickly but this one looks like it will be pretty big so we are expecting somewhat of a backup News We have release 6 new flavors Black Cow Black Cow is an old timey Root Beer float with Vanilla Ice Cream A delicious and refreshing vape especially during hot weather Dark Matter We have solved one of the greatest scientific mysteries of the Universe We have captured Dark Matter and put it in vape form Our scientists have ascertained that it is composed of Peppermint Chocolate amp Cream Gentlemen Juice Gentleman and ladies will find this a welcome change from your typical watermelon menthol flavors Gentleman juice is a sophisticated Candied Watermelon with rich concord grape juice and a touch of menthol Refreshing and delicious Mountain Top Mountain Top started as a simple strawberry watermelon vape but after we added a touch of crisp Fuji Apple and a dash of menthol we knew this one had fully taken shape With a very slight cooling effect Mountain Top is a refreshing summer standout that is sure to satisfy Samoan Clouds Samoan Clouds is a toasted coconut caramel and chocolate cookie that is very reminiscent of the seasonal cookies Titan Titan is the finished product of months of work and over a dozen renditions This is a sweet apricot nectarine torte finished with a tiny touch of candied ginger A definitely unique flavor profile that will satisfy desert lovers and fruit lovers alike Unfortunately we are forced to stop shipping to Utah and Indiana at this time If you would like to see our products in a vape shop near you we welcome referrals Please forward our information to your local shops or send us an email at info bluedotvapors com Thank you to you all for your support Vape On Sincerely Team Blue Dot bluedotvapors com

Explore Data

reddit_df %>%
  group_by(month = floor_date(comm_date, "month")) %>%
  summarize(comments = n()) %>%
  ggplot(aes(month, comments)) +
  geom_line() +
  labs(title = "Comments Timeline") +
  scale_x_date(date_labels = "%b/%y", date_breaks = "4 months") + 
  theme(axis.text.x = element_text(size = 7, angle = 45))

Explore Subreddit Threads

#number of comments
n_comments <- nrow(reddit_df)

#number of observations by subreddits
subreddits <- reddit_df %>%
  count(subreddit) %>%
  unique()

#number of subreddits
n_subreddits <- count(subreddits)

#number of threads  
threads <-reddit_df %>%
  count(thread_id) %>%
  unique()

summary_df <- tibble("number_subreddits"=n_subreddits, 
                     "number_threads"=n_distinct(threads),
                     "number_comments"=n_comments)

summary_df %>%
  kable() %>%
  kable_styling(c("striped", "bordered"), 
                full_width = FALSE, position = "left") %>%
  row_spec(row = 1, align = "right")
number_subreddits number_threads number_comments
171 1087 328828
#subreddit with more than 1000 comments
reddit_df %>%
  group_by(subreddit) %>%
  summarize(comments = n()) %>%
  filter(comments > 1000) %>%
  ggplot(aes(subreddit, comments, fill = comments)) +
  geom_col() +
  coord_flip() +
  labs(title = "Subreddits With Highest Number of Comments (>1000 comments)")

#display top 5 threads
top5 <- top_n(threads, 5) %>%
  arrange(desc(n))
## Selecting by n
top5 %>%
  kable(caption = "Top Five Thread by Comment Count") %>%
  kable_styling(font_size = 8)
Top Five Thread by Comment Count
thread_id n
http://www.reddit.com/r/electronic_cigarette/comments/25vvzu/249ml_giveaway_to_celebrate_my_first_year/?ref=search_posts 913
http://www.reddit.com/r/Coilporn/comments/28oc1i/contest_the_glowing_ohm_giveaway/?ref=search_posts 904
http://www.reddit.com/r/electronic_cigarette/comments/27bboo/round_1_and_2_winners_here_and_now_round_3_300ml/?ref=search_posts 501
http://www.reddit.com/r/electronic_cigarette/comments/2j1v09/thanks_reddit_3x_30ml_bottle_giveaway/?ref=search_posts 501
http://www.reddit.com/r/electronic_cigarette/comments/4658d5/vapemailclub_giveaway_what_are_you_vaping_on_25_x/?ref=search_posts 501

Explore Users

#user network
user_df <- reddit_comments %>%
  group_by(user) %>%
  summarise(n = n()) %>%
  filter(n<500) 
  
freq_users <- reddit_comments %>% 
  anti_join(user_df, by = "user") %>%
  mutate(user = str_replace_all(user,  "\\Q[deleted]\\E", " ")) %>%
  group_by(user) %>%
  summarise(n = n())

freq_users %>%
  filter(user != " ") %>%
  ggplot((aes(x = n, y = user, color = user))) +
  geom_point() +
  labs(x = "number of comments", y = "user", title = "Users With Highest Number of Comments (>500 comments)")

  #xlab("number of comments") +
  #ylab("freq_user") 

User Network

#user network for smallest thread
target_df <- reddit_comments[which(reddit_comments$num_comments == 
                                     min(reddit_comments$num_comments)), ]
network_list <- target_df %>%
  user_network(include_author=FALSE, agg=TRUE) 

network_list$plot

Tokenize Data

#tokenize
tokens <- reddit_df %>%
  unnest_tokens(output = word, input = comment)

token_count <- tokens %>%
  count(word,
        sort = TRUE)

cleaned_tokens <- tokens %>% 
  anti_join(get_stopwords())
## Joining, by = "word"
#find numbers
nums <- cleaned_tokens %>% 
  filter(str_detect(word, "^[0-9]")) %>% 
  select(word) %>% unique()

#remove numbers
cleaned_tokens <- cleaned_tokens %>% 
  anti_join(nums, by = "word")

#length(unique(cleaned_tokens$word))

rare <- cleaned_tokens %>% 
  count(word) %>% 
  filter(n<10) %>% 
  select(word) %>% 
  unique()

cleaned_tokens <- cleaned_tokens %>% 
  anti_join(rare, by = "word")

letters <- cleaned_tokens %>% 
  filter(str_length(word) < 3) %>% 
  select(word) %>% 
  unique()  

cleaned_tokens <- cleaned_tokens %>% 
  anti_join(letters, by = "word")

word_count <- cleaned_tokens %>%
  count(word,
        sort = TRUE)

word_by_subreddit <- cleaned_tokens %>%
  count(subreddit, word, sort= TRUE) %>%
  ungroup()

kable(head(word_count), caption = "word count") %>%
  kable_styling(full_width=FALSE, position= "left")
word count
word n
thanks 42390
like 41007
just 37763
can 33102
one 25645
people 25335
kable(head(word_by_subreddit), caption = "word by subreddit") %>%
  kable_styling()
word by subreddit
subreddit word n
electronic_cigarette thanks 40074
electronic_cigarette like 21526
electronic_cigarette just 17831
electronic_cigarette juice 17313
electronic_cigarette giveaway 16787
electronic_cigarette can 15198

Visualize Word Count

#histogram
cleaned_tokens %>%
  count(word, sort = T) %>%
  rename(word_freq = n) %>%
  ggplot(aes(x=word_freq)) +
  geom_histogram(aes(y=..count..), 
                 color="black", 
                 fill="blue", 
                 alpha=0.3) + 
  scale_x_continuous(breaks=c(0:5,10,100,500,10e3), 
                     trans="log1p", 
                     expand=c(0,0)) + 
  scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), 
                     expand=c(0,0)) + 
  theme_bw() +
  labs(title = "Word Frequency Histogram")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Natural Language Processing (NLP) Procedure Summary

Our team used text mining techniques to surveil e-cigarettes young users who post content on Reddit. The goal is to understand how youth and young adults discuss topics of reasons for use, harm perception, frequency of use, flavorings, ad exposure, and quitting experience.

We employed sentiment analysis algorithms to analysis positive and negative words/sentences used to describe e-cigarette products. We also analyzed word frequencies to compare frequencies across different subreddits to discover which words occur most often in discussions and identify emerging products and trends.

Sentiment Analysis

For our sentiment analysis, we used the bing and afinn lexicon. We examined overall words as well as word by subreddits.

comment_sentiment = cleaned_tokens %>% 
  left_join(get_sentiments("bing")) %>% 
  rename(bing = sentiment) %>% 
  left_join(get_sentiments("afinn")) %>% 
  rename(afinn = score)
## Joining, by = "word"
## Joining, by = "word"
#Sentiment Analysis - bing by word
bing_word_counts <- comment_sentiment %>% 
  filter(!is.na(bing)) %>% 
  count(word, bing, sort = TRUE)

bing_word_counts %>%
  filter(n > 5000) %>%
  mutate(n = ifelse(bing == "negative", -n, n)) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = bing)) +
  geom_col() +
  coord_flip() +
  labs(y = "Contribution to sentiment - bing")

#Sentiment Analysis by subreddit - bing
subreddit_sent_bing <- word_by_subreddit %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  mutate(sentiment = ifelse(sentiment == "negative", -1, 1)) %>% 
  group_by(subreddit) %>%
  summarize(sentiment = sum(sentiment * n) / sum(n))

subreddit_sent_bing %>%
  top_n(30, abs(sentiment)) %>%
  mutate(subreddit = reorder(subreddit, sentiment)) %>%
  ggplot(aes(subreddit, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average bing sentiment score - bing")

#--------------------------------------------------------------
#Sentiment Analysis - afinn by word
afinn_word <- comment_sentiment %>%
  filter(!is.na(afinn)) %>%
  select(word, afinn) %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(afinn))


afinn_word %>%
  top_n(25, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(y = "Contribution to sentiment - afinn")

#Sentiment Analysis - afinn by subreddit
subreddit_sentiment <- word_by_subreddit %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(subreddit) %>%
  summarize(score = sum(score * n) / sum(n))

subreddit_sentiment %>%
  top_n(30, abs(score)) %>%
  mutate(subreddit = reorder(subreddit, score)) %>%
  ggplot(aes(subreddit, score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average sentiment score - afinn")

Relationships Between Words

# word correlation
uncommon <- cleaned_tokens %>% 
  count(word) %>%
  filter(n<1000) %>%
  select(word) %>% 
  unique()

word_cor = cleaned_tokens %>% 
  anti_join(uncommon, by = "word") %>% 
  widyr::pairwise_cor(word, thread_id) %>% 
  filter(!is.na(correlation),
         correlation > .50)

# Visualizing the correlations                                             
word_cor_plot <- word_cor %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 5) + 
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()  

word_cor_plot

#Tokenizing by bi-gram
cleaned_text <- reddit_df %>% 
  #mutate(comment = str_replace_all(comment,  "[0-9]", " "))  %>%
  mutate(comment = str_replace_all(comment, "^[a-z]{0,2}$", " ")) %>%
  select(subreddit, comment)

bigrams <- cleaned_text %>% 
  unnest_tokens(bigram, comment,
                token = "ngrams", n = 2) 
    
# bigram counts:
bigrams %>%
  count(bigram, sort = TRUE)
## # A tibble: 1,593,954 x 2
##    bigram         n
##    <chr>      <int>
##  1 it s       33438
##  2 i m        31030
##  3 for the    28107
##  4 don t      25075
##  5 thanks for 21698
##  6 of the     18977
##  7 in the     18898
##  8 i ve       13857
##  9 if you     12879
## 10 and i      12711
## # … with 1,593,944 more rows
# filtering n-grams
bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word & 
           str_length(word1)>1 &
           word1 != "removed") %>% 
  filter(!word2 %in% stop_words$word &
           str_length(word2)>1 &
           word2 != "removed")

new_bigrams <-bigrams_filtered %>%
  count(word1, word2, sort = TRUE)
  
# new bigram counts - first 5 rows:
kable(head(new_bigrams), caption = "Bi-gram") %>%
  kable_styling(full_width=FALSE, position= "left")
Bi-gram
word1 word2 n
quit smoking 2714
message compose 1382
awesome giveaway 1356
max vg 1298
amp nbsp 1213
box mod 1207
#-----------------------------------------------
#Tokenizing by tri-gram
trigrams <- reddit_df %>% 
  mutate(comment = str_replace_na(comment)) %>%
  mutate(comment = str_replace_all(comment,  "[0-9]|NA", " "))  %>%
  unnest_tokens(trigram, comment,
                token = "ngrams", n = 3) 
#trigrams %>% select(trigram)

# trigram counts:
#trigrams %>%
#  count(trigram, sort = TRUE)

# filtering n-grams
trigrams_separated <- trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>% 
  filter(!word1 %in% stop_words$word & is.na(word1) == FALSE) %>% 
  filter(!word2 %in% stop_words$word & is.na(word2) == FALSE) %>% 
  filter(!word3 %in% stop_words$word & is.na(word3) == FALSE)
  

new_trigram <- trigrams_filtered %>%
  count(word1, word2, word3, sort = TRUE)

# new trigram counts:
kable(head(new_trigram), caption = "Tri-gram") %>%
  kable_styling()
Tri-gram
word1 word2 word3 n
subreddit message compose 1131
reviews official site 876
love donuts mg 647
mg max vg 624
wikipedia org wiki 596
evic vtc mini 559
# define a nice color palette
pal <- brewer.pal(8,"Dark2")

# plot the 150 most common words
trigrams_filtered %>%
  count(word3) %>%
  with(wordcloud(word3, n, scale=c(4,.1), min.freq=50, max.words = 150,
                 random.order = FALSE, rot.per=.15, colors=pal))

trigrams_filtered %>%
  count(word2) %>%
  with(wordcloud(word2, n, scale=c(4,.1), min.freq=50, max.words=150,
                random.order=FALSE, rot.per=.15,
                colors=pal))

trigrams_filtered %>%
  count(word1) %>%
  with(wordcloud(word1, n, scale=c(4,.1), min.freq=50, max.words = 150,
                 random.order = FALSE, rot.per=.15, colors=pal))

Word and Document Frequency

Word Frequency

#Document(thread) Term Matrix
#Post Term Matrix

word_counts_by_thread_id <- cleaned_tokens %>% 
  group_by(thread_id) %>%
  count(word, sort = TRUE)

review_dtm <- word_counts_by_thread_id %>% 
  cast_dtm(thread_id, word, n)

review_dtm
## <<DocumentTermMatrix (documents: 1087, terms: 17682)>>
## Non-/sparse entries: 1255964/17964370
## Sparsity           : 93%
## Maximal term length: 49
## Weighting          : term frequency (tf)
#tf-idf
tfidf <- word_counts_by_thread_id %>% 
  bind_tf_idf(word, thread_id, n) 

top_tfidf = tfidf %>%
  group_by(thread_id) %>%
  arrange(desc(tf_idf)) 

kable(head(top_tfidf), caption = "Term Frequency by Thread") %>%
  kable_styling(font_size = 9)
Term Frequency by Thread
thread_id word n tf idf tf_idf
http://www.reddit.com/r/electronic_cigarette/comments/41qenv/we_have_officially_been_open_for_an_entire_year/?ref=search_posts vapeaversary 418 0.3866790 6.298030 2.4353158
http://www.reddit.com/r/electronic_cigarette/comments/4wc97f/blue_dot_vapors_giveaway_sale_news_release_of_6/?ref=search_posts nfda 459 0.2372093 6.298030 1.4939512
http://www.reddit.com/r/electronic_cigarette/comments/29i7n7/vaporhop_50_bottle_750_ml_itaste_kit_giveaway_37/?ref=search_posts vaporhop 321 0.1244668 6.298030 0.7838959
http://www.reddit.com/r/electronic_cigarette/comments/3ek576/new_bot_pretends_its_you_based_on_comment_history/?ref=search_posts user_simulator 385 0.1221059 6.298030 0.7690268
http://www.reddit.com/r/electronic_cigarette/comments/3jmxt5/labor_day_sale_and_giveaway_35_off_everything_in/?ref=search_posts sicilian 239 0.1151252 6.298030 0.7250622
http://www.reddit.com/r/vapeitforward/comments/1wiaq3/reddit_raffle_nemmy_clone_extras_giveaway/?ref=search_posts nemmy 487 0.1513833 4.100805 0.6207933

Word Frequency by Subreddit

tf_idf <- word_by_subreddit %>%
  bind_tf_idf(word, subreddit, n) %>%
  arrange(desc(tf_idf))

tf_idf %>%
  filter(str_detect(subreddit, "^vap|^Vap|^ele|^sci|^teen|^ecig|^Inn")) %>%
  group_by(subreddit) %>%
  top_n(12, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf-idf)) %>%
  ggplot((aes(word, tf_idf, fill = subreddit))) +
  geom_col(show.legend = FALSE) +
  facet_wrap( ~ subreddit, scale = "free") +
  ylab("tf-idf") +
  coord_flip() +
  labs(title = "Term Frequency By Subreddit")

subreddit_cors <- word_by_subreddit %>%
  pairwise_cor(subreddit, word, n, sort = TRUE)

head(subreddit_cors)
## # A tibble: 6 x 3
##   item1         item2         correlation
##   <chr>         <chr>               <dbl>
## 1 worldnews     news                0.972
## 2 news          worldnews           0.972
## 3 technology    worldnews           0.948
## 4 worldnews     technology          0.948
## 5 unitedkingdom worldnews           0.948
## 6 worldnews     unitedkingdom       0.948
set.seed(1231)

subreddit_cors %>%
  filter(correlation > .90) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size = 6, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Topic Modeling

lda5 <- LDA(review_dtm, k = 5, control = list(seed = 1234)) 

topics <- terms(lda5, 100)
kable(head(topics)) %>%
  kable_styling()
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
juice like nicotine just thanks
com just people like giveaway
thanks one can can love
mini can just get good
name get smoking vaping awesome
smok battery like one thank

NLP result summary and discussion

Sentiment Analysis
We analyzed sentiment by word and subreddit and found that the majority of words in electronic cigarette posts and comments were positive. Users used words like “good”, “awesome”, “happy”, and “love” in discussions. Most subreddits also had an overall positive sentiment.

Relationships Between Words
Bigram - We noted strong relationships between words such as “quit and smoking”, “awesome and giveaway”, “started and vaping”. More analysis is needed to gain information about the reasons/motivations for those who wish to quit. Also, we would like to understand the role of giveaways in encouraging young smokers to start smoking/vaping.

Trigram - The trigram analysis did not reveal any significant trend related to smoking or e-cigarettes. Additional analysis is required.

Word and Document Frequency - Our analysis revealed that words used frequently in discussions were “juice”, “vaping”, and “giveaway”. Young e-cigarette users are attracted to flavored products or (juice). Additional analysis is needed to understand how these products are discussed. Also, the frequency of the word “giveaway” suggests that free products are being offered often as an enticement for users to begin smoking.

Topic Modeling - The results from the topic models are consistent with the other analyses completed. All topics contain similar e-cigarette related words.