Team Members
Business Context
Problem Description
Data Summary, Exploration, and Discussion
Natural Language Processing (NLP) Procedure Summary
NLP result summary and discussion

Team Members

Catresa Barlow & Wan-Ting Tsai

Business Context

E-cigarettes entered the US markets in 2007 (Preventing Tobacco Addiction Foundation, 2019) and use among youth and young adults has increased steadily since the product’s introduction (Office of the Surgeon General, 2016). Sweet flavors like candy apple, bubble gum, marshmallow, cherry cola, smores, chocolate, orange soda, and taffy entice young people to try these products. Online availability of flavored tobacco products makes these products easily accessible by minors (Preventing Tobacco Addiction Foundation, 2019).

Tobacco use among youth and young adults represents a major public health concern in the United States. “More than 95% of addicted smokers start before age 21”. Nicotine changes the receptor in the adolescent brain and creates lifelong addiction. (Preventing Tobacco Addiction Foundation, 2019). Although the use of conventional tobacco products by youth and young adults declined in recent decades, the Centers for Disease Control reports an increase in the use of “emerging tobacco products” like e-cigarettes among this population (Office of the Surgeon General, 2016).

Problem Description

Tobacco companies employ social media to market their products to youth and young adults. Cessation and prevention campaigns require an understanding of how this population communicates about and uses these products. A key challenge for surveillance of the products and understanding their patterns of use is the diverse and nonstandard nomenclature for the devices (Alexander et al. 2016). These devices are referred to, by the companies themselves, and by consumers, as “e-cigarettes,” “e-cigs,” “cigalikes,” “e-hookahs,” “mods,” “vape pens,” “vapes,” and “tank systems.”

We hope to gain insight into tobacco marketing and patterns of use by analyzing social media platforms. Our project focuses on smoking among youth and young adults, and we believe Reddit is a platform used by this group. We wish to learn how tobacco companies communicate with this population and how this population communicates among itself on the subject of e-cigarettes.

Data Summary, Exploration, and Discussion

Our team used RedditExtractoR to collect our data. We extracted posts and comments from subreddits using the following keywords - “vape”, “e-cigarettes”, “vaping”, “juul”, “e-cig”.Our dataset contains 365k rows and 18 columns.

Load Data and Pre-process Data

reddit_comments <- read_csv("reddit_data.csv")

## Parsed with column specification:
## cols(
##   id = col_double(),
##   structure = col_character(),
##   post_date = col_character(),
##   comm_date = col_character(),
##   num_comments = col_double(),
##   subreddit = col_character(),
##   upvote_prop = col_double(),
##   post_score = col_double(),
##   author = col_character(),
##   user = col_character(),
##   comment_score = col_double(),
##   controversiality = col_double(),
##   comment = col_character(),
##   title = col_character(),
##   post_text = col_character(),
##   link = col_character(),
##   domain = col_character(),
##   URL = col_character()
## )

#check for unique rows
reddit_comments <- unique(reddit_comments)

kable(reddit_comments[1000, ], caption = "Raw Reddit Data") %>%
  kable_styling(font_size = 10) %>%
  scroll_box(width = "100%", height = "200px")

Raw Reddit Data
id	structure	post_date	comm_date	num_comments	subreddit	upvote_prop	post_score	author	user	comment_score	controversiality	comment	title	post_text	link	domain	URL
87	42_1	15-12-15	15-12-15	662	electronic_cigarette	0.88	79	northwest_juice	northwest_juice	1	0	what is this?? Sounds dangerous and tasty	990ml [giveaway]! + Buy One Get One Free	TLDR: Buy One Get One Free, nwjuice.com CODE STILL ACTIVE THROUGH THE END OF TODAY DEC 16 Winners announced: u/Sychophantom u/PeggsVaper u/tehdweeb u/Something_Berserker u/mccoog40 u/MMMWWW50 u/vapaiolo u/minetoo05 u/switzy u/TigerCR1200 u/RoadSurfer Please send us a message with your email so we can send you an invoice for your free product. There’s no need to send your address as you will need to fill this out on the invoice we send you. Congratulations to all of the winners and thank you for participating! Have a merry Christmas, Happy Hanukkah, Kwanzaa, etc. -Ian Happy Holidays! Only 9 more days until that day of the year many of us look forward to! In celebration of the holidays we are hosting another giveaway. To enter, all you need to do is list your favorite aspect(s) of the holiday season. For me (Ian) personally, I enjoy the time I get to spend with family and watching my little brother open his gifts from Santa (not Krampus, please no). We will draw, then announce the winners of our giveaway at noon tomorrow (Dec. 16th). At least 11 winners will be selected to win 90ml of joooos from us (up to 3 variations). In addition to this giveaway, we will be doubling all orders for free when the code REDDITDOUBLE is entered or mentioned at checkout LINK (this will end tomorrow at midnight). If you have a preference on the free portion of your order, please leave a note, otherwise we will simply double the quantity of the juice you place an order for. We want to be able to guarantee delivery before xmas, so we will update this post if we receive a large volume of orders. If your order comes in after we post an update, we will send you an email notification as a courtesy. Always free us-shipping on orders over $10 NEW THINGS: We are having issues with our Hot Apple Cider flavor (in our opinion, no customers have mentioned any issues), so we have pulled it from the site while we attempt to fix it. Our last promotion gave us a backlog of orders which we have finally been able to deliver to all customers. If you had any issues, please contact us. Note: The coupon code does nothing other than let us know to double up your order, our checkout system isnt fancy enough to automatically apply a buy one get one free deal. Thank you for taking the time to read this. Good luck! and VAPE ON	https://www.reddit.com/r/electronic_cigarette/comments/3wyvrr/990ml_giveaway_buy_one_get_one_free/	self.electronic_cigarette	http://www.reddit.com/r/electronic_cigarette/comments/3wyvrr/990ml_giveaway_buy_one_get_one_free/?ref=search_posts

kable(summary(reddit_comments), caption = "Data Summary") %>%
  kable_styling(font_size = 10) %>%
  scroll_box(width ="100%", height = "200px")

Data Summary
id	structure	post_date	comm_date	num_comments	subreddit	upvote_prop	post_score	author	user	comment_score	controversiality	comment	title	post_text	link	domain	URL
Min. : 1.0	Length:364467	Length:364467	Length:364467	Min. : 83.0	Length:364467	Min. :0.2900	Min. : 0	Length:364467	Length:364467	Min. : -195.00	Min. :0.00000	Length:364467	Length:364467	Length:364467	Length:364467	Length:364467	Length:364467
1st Qu.: 78.0	Class :character	Class :character	Class :character	1st Qu.: 332.0	Class :character	1st Qu.:0.8400	1st Qu.: 70	Class :character	Class :character	1st Qu.: 1.00	1st Qu.:0.00000	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character
Median :172.0	Mode :character	Mode :character	Mode :character	Median : 672.0	Mode :character	Median :0.8800	Median : 128	Mode :character	Mode :character	Median : 1.00	Median :0.00000	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character
Mean :196.7	NA	NA	NA	Mean : 845.5	NA	Mean :0.8733	Mean : 2910	NA	NA	Mean : 11.63	Mean :0.01709	NA	NA	NA	NA	NA	NA
3rd Qu.:309.0	NA	NA	NA	3rd Qu.: 940.0	NA	3rd Qu.:0.9200	3rd Qu.: 593	NA	NA	3rd Qu.: 2.00	3rd Qu.:0.00000	NA	NA	NA	NA	NA	NA
Max. :500.0	NA	NA	NA	Max. :12751.0	NA	Max. :1.0000	Max. :97636	NA	NA	Max. :17588.00	Max. :1.00000	NA	NA	NA	NA	NA	NA

#restructure data frame for comments
reddit_comment_df <- reddit_comments %>%
  mutate(thread_id = URL) %>%
  mutate(comm_date = dmy(comm_date)) %>%
  mutate(comment = str_replace_all(comment,  "<.+>", " ")) %>%
  mutate(comment = str_replace_all(comment,  "\\W", " ")) %>%
  mutate(comment = str_replace_all(comment,  
                                   "www|https|http", " ")) %>%
  mutate(comment_id = structure,
         comment_score = as.numeric(comment_score)) %>%
  select(comm_date, subreddit, thread_id, user, 
         comment_id, comment) %>% 
  unique() %>%
  na.omit()

# restructure dataframe for title and post text
reddit_post_df <- reddit_comments %>%
  mutate(comm_date = dmy(post_date)) %>%
  mutate(thread_id = URL) %>%
  mutate(user = author) %>%
  mutate(comment_id = 0) %>%
  mutate(post_text = replace_na(post_text, "."),
         title = replace_na(title, ".")) %>%
  mutate(post_text = str_replace_all(post_text,  
                                     "<.+>|www|https|http", " "),
         title = str_replace_all(title,  
                                   "<.+>|www|https|http", " ")) %>%
  mutate(text = str_c(title, post_text, sep = ",")) %>%
  mutate(comment = text) %>%
  mutate(comment = str_replace_all(comment, "\\W", " ")) %>%
  select(comm_date, subreddit, thread_id, user, 
         comment_id, comment) %>%
  unique() %>%
  na.omit()

#combine comments and post text
reddit_df <- rbind(reddit_post_df, reddit_comment_df)

#display dataset
kable(reddit_df[1, ], caption = "Selected Reddit Data") %>%
  kable_styling(font_size = 10) %>%
  scroll_box(width = "100%", height = "200px")

Selected Reddit Data
comm_date	subreddit	thread_id	user	comment_id	comment
2016-07-04	electronic_cigarette	http://www.reddit.com/r/electronic_cigarette/comments/4r8dhk/blue_dot_vapors_4th_of_july_giveaway_1_day_1_big/?ref=search_posts	BlueDotVapors	0	Blue Dot Vapors 4th of July giveaway 1 day 1 big winner Intro Hello ECR Happy 4th of July everyone We hope everyone stays safe and has fun Giveaway We are giving away 1776ml to one lucky winner To enter please copy and pasta d d You must be of legal smoking age to participate Winners will be selected at random Giveaway ends at 11 59pm PDT 8GMT 7 4 16 International entrees welcome Void where prohibited One entry per contestant Age will be verified before prizes are distributed Congratulations you won 1776ml of eliquid u KGKiddyDiddler Sale Until July 4th at midnight Pacific time we will have 20 off all orders and 50 off all presteep flavors with code MericasBirth We discounted all of the presteeps from 7 99 to 5 After the 20 discount this brings them to 4 and 50 off Please be aware that during sale times order processing can take up to 5 business days We have refined our processes and are faster than ever so we hope to have orders out quickly but this one looks like it will be pretty big so we are expecting somewhat of a backup News We have release 6 new flavors Black Cow Black Cow is an old timey Root Beer float with Vanilla Ice Cream A delicious and refreshing vape especially during hot weather Dark Matter We have solved one of the greatest scientific mysteries of the Universe We have captured Dark Matter and put it in vape form Our scientists have ascertained that it is composed of Peppermint Chocolate amp Cream Gentlemen Juice Gentleman and ladies will find this a welcome change from your typical watermelon menthol flavors Gentleman juice is a sophisticated Candied Watermelon with rich concord grape juice and a touch of menthol Refreshing and delicious Mountain Top Mountain Top started as a simple strawberry watermelon vape but after we added a touch of crisp Fuji Apple and a dash of menthol we knew this one had fully taken shape With a very slight cooling effect Mountain Top is a refreshing summer standout that is sure to satisfy Samoan Clouds Samoan Clouds is a toasted coconut caramel and chocolate cookie that is very reminiscent of the seasonal cookies Titan Titan is the finished product of months of work and over a dozen renditions This is a sweet apricot nectarine torte finished with a tiny touch of candied ginger A definitely unique flavor profile that will satisfy desert lovers and fruit lovers alike Unfortunately we are forced to stop shipping to Utah and Indiana at this time If you would like to see our products in a vape shop near you we welcome referrals Please forward our information to your local shops or send us an email at info bluedotvapors com Thank you to you all for your support Vape On Sincerely Team Blue Dot bluedotvapors com

Explore Data

reddit_df %>%
  group_by(month = floor_date(comm_date, "month")) %>%
  summarize(comments = n()) %>%
  ggplot(aes(month, comments)) +
  geom_line() +
  labs(title = "Comments Timeline") +
  scale_x_date(date_labels = "%b/%y", date_breaks = "4 months") + 
  theme(axis.text.x = element_text(size = 7, angle = 45))

Explore Subreddit Threads

#number of comments
n_comments <- nrow(reddit_df)

#number of observations by subreddits
subreddits <- reddit_df %>%
  count(subreddit) %>%
  unique()

#number of subreddits
n_subreddits <- count(subreddits)

#number of threads  
threads <-reddit_df %>%
  count(thread_id) %>%
  unique()

summary_df <- tibble("number_subreddits"=n_subreddits, 
                     "number_threads"=n_distinct(threads),
                     "number_comments"=n_comments)

summary_df %>%
  kable() %>%
  kable_styling(c("striped", "bordered"), 
                full_width = FALSE, position = "left") %>%
  row_spec(row = 1, align = "right")

number_subreddits	number_threads	number_comments
171	1087	328828

#subreddit with more than 1000 comments
reddit_df %>%
  group_by(subreddit) %>%
  summarize(comments = n()) %>%
  filter(comments > 1000) %>%
  ggplot(aes(subreddit, comments, fill = comments)) +
  geom_col() +
  coord_flip() +
  labs(title = "Subreddits With Highest Number of Comments (>1000 comments)")

#display top 5 threads
top5 <- top_n(threads, 5) %>%
  arrange(desc(n))

## Selecting by n

top5 %>%
  kable(caption = "Top Five Thread by Comment Count") %>%
  kable_styling(font_size = 8)

Top Five Thread by Comment Count
thread_id	n
http://www.reddit.com/r/electronic_cigarette/comments/25vvzu/249ml_giveaway_to_celebrate_my_first_year/?ref=search_posts	913
http://www.reddit.com/r/Coilporn/comments/28oc1i/contest_the_glowing_ohm_giveaway/?ref=search_posts	904
http://www.reddit.com/r/electronic_cigarette/comments/27bboo/round_1_and_2_winners_here_and_now_round_3_300ml/?ref=search_posts	501
http://www.reddit.com/r/electronic_cigarette/comments/2j1v09/thanks_reddit_3x_30ml_bottle_giveaway/?ref=search_posts	501
http://www.reddit.com/r/electronic_cigarette/comments/4658d5/vapemailclub_giveaway_what_are_you_vaping_on_25_x/?ref=search_posts	501

Explore Users

#user network
user_df <- reddit_comments %>%
  group_by(user) %>%
  summarise(n = n()) %>%
  filter(n<500) 
  
freq_users <- reddit_comments %>% 
  anti_join(user_df, by = "user") %>%
  mutate(user = str_replace_all(user,  "\\Q[deleted]\\E", " ")) %>%
  group_by(user) %>%
  summarise(n = n())

freq_users %>%
  filter(user != " ") %>%
  ggplot((aes(x = n, y = user, color = user))) +
  geom_point() +
  labs(x = "number of comments", y = "user", title = "Users With Highest Number of Comments (>500 comments)")

  #xlab("number of comments") +
  #ylab("freq_user")

User Network

#user network for smallest thread
target_df <- reddit_comments[which(reddit_comments$num_comments == 
                                     min(reddit_comments$num_comments)), ]
network_list <- target_df %>%
  user_network(include_author=FALSE, agg=TRUE) 

network_list$plot

Tokenize Data

#tokenize
tokens <- reddit_df %>%
  unnest_tokens(output = word, input = comment)

token_count <- tokens %>%
  count(word,
        sort = TRUE)

cleaned_tokens <- tokens %>% 
  anti_join(get_stopwords())

## Joining, by = "word"

#find numbers
nums <- cleaned_tokens %>% 
  filter(str_detect(word, "^[0-9]")) %>% 
  select(word) %>% unique()

#remove numbers
cleaned_tokens <- cleaned_tokens %>% 
  anti_join(nums, by = "word")

#length(unique(cleaned_tokens$word))

rare <- cleaned_tokens %>% 
  count(word) %>% 
  filter(n<10) %>% 
  select(word) %>% 
  unique()

cleaned_tokens <- cleaned_tokens %>% 
  anti_join(rare, by = "word")

letters <- cleaned_tokens %>% 
  filter(str_length(word) < 3) %>% 
  select(word) %>% 
  unique()  

cleaned_tokens <- cleaned_tokens %>% 
  anti_join(letters, by = "word")

word_count <- cleaned_tokens %>%
  count(word,
        sort = TRUE)

word_by_subreddit <- cleaned_tokens %>%
  count(subreddit, word, sort= TRUE) %>%
  ungroup()

kable(head(word_count), caption = "word count") %>%
  kable_styling(full_width=FALSE, position= "left")

word count
word	n
thanks	42390
like	41007
just	37763
can	33102
one	25645
people	25335

kable(head(word_by_subreddit), caption = "word by subreddit") %>%
  kable_styling()

word by subreddit
subreddit	word	n
electronic_cigarette	thanks	40074
electronic_cigarette	like	21526
electronic_cigarette	just	17831
electronic_cigarette	juice	17313
electronic_cigarette	giveaway	16787
electronic_cigarette	can	15198

Visualize Word Count

#histogram
cleaned_tokens %>%
  count(word, sort = T) %>%
  rename(word_freq = n) %>%
  ggplot(aes(x=word_freq)) +
  geom_histogram(aes(y=..count..), 
                 color="black", 
                 fill="blue", 
                 alpha=0.3) + 
  scale_x_continuous(breaks=c(0:5,10,100,500,10e3), 
                     trans="log1p", 
                     expand=c(0,0)) + 
  scale_y_continuous(breaks=c(0,100,1000,5e3,10e3,5e4,10e4,4e4), 
                     expand=c(0,0)) + 
  theme_bw() +
  labs(title = "Word Frequency Histogram")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Natural Language Processing (NLP) Procedure Summary

Our team used text mining techniques to surveil e-cigarettes young users who post content on Reddit. The goal is to understand how youth and young adults discuss topics of reasons for use, harm perception, frequency of use, flavorings, ad exposure, and quitting experience.

We employed sentiment analysis algorithms to analysis positive and negative words/sentences used to describe e-cigarette products. We also analyzed word frequencies to compare frequencies across different subreddits to discover which words occur most often in discussions and identify emerging products and trends.

Sentiment Analysis

For our sentiment analysis, we used the bing and afinn lexicon. We examined overall words as well as word by subreddits.

comment_sentiment = cleaned_tokens %>% 
  left_join(get_sentiments("bing")) %>% 
  rename(bing = sentiment) %>% 
  left_join(get_sentiments("afinn")) %>% 
  rename(afinn = score)

## Joining, by = "word"
## Joining, by = "word"

#Sentiment Analysis - bing by word
bing_word_counts <- comment_sentiment %>% 
  filter(!is.na(bing)) %>% 
  count(word, bing, sort = TRUE)

bing_word_counts %>%
  filter(n > 5000) %>%
  mutate(n = ifelse(bing == "negative", -n, n)) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = bing)) +
  geom_col() +
  coord_flip() +
  labs(y = "Contribution to sentiment - bing")

#Sentiment Analysis by subreddit - bing
subreddit_sent_bing <- word_by_subreddit %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  mutate(sentiment = ifelse(sentiment == "negative", -1, 1)) %>% 
  group_by(subreddit) %>%
  summarize(sentiment = sum(sentiment * n) / sum(n))

subreddit_sent_bing %>%
  top_n(30, abs(sentiment)) %>%
  mutate(subreddit = reorder(subreddit, sentiment)) %>%
  ggplot(aes(subreddit, sentiment, fill = sentiment > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average bing sentiment score - bing")

#--------------------------------------------------------------
#Sentiment Analysis - afinn by word
afinn_word <- comment_sentiment %>%
  filter(!is.na(afinn)) %>%
  select(word, afinn) %>%
  group_by(word) %>%
  summarize(occurences = n(),
            contribution = sum(afinn))


afinn_word %>%
  top_n(25, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(word, contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(y = "Contribution to sentiment - afinn")

#Sentiment Analysis - afinn by subreddit
subreddit_sentiment <- word_by_subreddit %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(subreddit) %>%
  summarize(score = sum(score * n) / sum(n))

subreddit_sentiment %>%
  top_n(30, abs(score)) %>%
  mutate(subreddit = reorder(subreddit, score)) %>%
  ggplot(aes(subreddit, score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average sentiment score - afinn")

Relationships Between Words

# word correlation
uncommon <- cleaned_tokens %>% 
  count(word) %>%
  filter(n<1000) %>%
  select(word) %>% 
  unique()

word_cor = cleaned_tokens %>% 
  anti_join(uncommon, by = "word") %>% 
  widyr::pairwise_cor(word, thread_id) %>% 
  filter(!is.na(correlation),
         correlation > .50)

# Visualizing the correlations                                             
word_cor_plot <- word_cor %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
  geom_node_point(color = "lightblue", size = 5) + 
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()  

word_cor_plot

#Tokenizing by bi-gram
cleaned_text <- reddit_df %>% 
  #mutate(comment = str_replace_all(comment,  "[0-9]", " "))  %>%
  mutate(comment = str_replace_all(comment, "^[a-z]{0,2}$", " ")) %>%
  select(subreddit, comment)

bigrams <- cleaned_text %>% 
  unnest_tokens(bigram, comment,
                token = "ngrams", n = 2) 
    
# bigram counts:
bigrams %>%
  count(bigram, sort = TRUE)

## # A tibble: 1,593,954 x 2
##    bigram         n
##    <chr>      <int>
##  1 it s       33438
##  2 i m        31030
##  3 for the    28107
##  4 don t      25075
##  5 thanks for 21698
##  6 of the     18977
##  7 in the     18898
##  8 i ve       13857
##  9 if you     12879
## 10 and i      12711
## # … with 1,593,944 more rows

# filtering n-grams
bigrams_separated <- bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word & 
           str_length(word1)>1 &
           word1 != "removed") %>% 
  filter(!word2 %in% stop_words$word &
           str_length(word2)>1 &
           word2 != "removed")

new_bigrams <-bigrams_filtered %>%
  count(word1, word2, sort = TRUE)
  
# new bigram counts - first 5 rows:
kable(head(new_bigrams), caption = "Bi-gram") %>%
  kable_styling(full_width=FALSE, position= "left")

Bi-gram
word1	word2	n
quit	smoking	2714
message	compose	1382
awesome	giveaway	1356
max	vg	1298
amp	nbsp	1213
box	mod	1207

#-----------------------------------------------
#Tokenizing by tri-gram
trigrams <- reddit_df %>% 
  mutate(comment = str_replace_na(comment)) %>%
  mutate(comment = str_replace_all(comment,  "[0-9]|NA", " "))  %>%
  unnest_tokens(trigram, comment,
                token = "ngrams", n = 3) 
#trigrams %>% select(trigram)

# trigram counts:
#trigrams %>%
#  count(trigram, sort = TRUE)

# filtering n-grams
trigrams_separated <- trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>% 
  filter(!word1 %in% stop_words$word & is.na(word1) == FALSE) %>% 
  filter(!word2 %in% stop_words$word & is.na(word2) == FALSE) %>% 
  filter(!word3 %in% stop_words$word & is.na(word3) == FALSE)
  

new_trigram <- trigrams_filtered %>%
  count(word1, word2, word3, sort = TRUE)

# new trigram counts:
kable(head(new_trigram), caption = "Tri-gram") %>%
  kable_styling()

Tri-gram
word1	word2	word3	n
subreddit	message	compose	1131
reviews	official	site	876
love	donuts	mg	647
mg	max	vg	624
wikipedia	org	wiki	596
evic	vtc	mini	559

# define a nice color palette
pal <- brewer.pal(8,"Dark2")

# plot the 150 most common words
trigrams_filtered %>%
  count(word3) %>%
  with(wordcloud(word3, n, scale=c(4,.1), min.freq=50, max.words = 150,
                 random.order = FALSE, rot.per=.15, colors=pal))

trigrams_filtered %>%
  count(word2) %>%
  with(wordcloud(word2, n, scale=c(4,.1), min.freq=50, max.words=150,
                random.order=FALSE, rot.per=.15,
                colors=pal))

trigrams_filtered %>%
  count(word1) %>%
  with(wordcloud(word1, n, scale=c(4,.1), min.freq=50, max.words = 150,
                 random.order = FALSE, rot.per=.15, colors=pal))

Word and Document Frequency

Word Frequency

#Document(thread) Term Matrix
#Post Term Matrix

word_counts_by_thread_id <- cleaned_tokens %>% 
  group_by(thread_id) %>%
  count(word, sort = TRUE)

review_dtm <- word_counts_by_thread_id %>% 
  cast_dtm(thread_id, word, n)

review_dtm

## <<DocumentTermMatrix (documents: 1087, terms: 17682)>>
## Non-/sparse entries: 1255964/17964370
## Sparsity           : 93%
## Maximal term length: 49
## Weighting          : term frequency (tf)

#tf-idf
tfidf <- word_counts_by_thread_id %>% 
  bind_tf_idf(word, thread_id, n) 

top_tfidf = tfidf %>%
  group_by(thread_id) %>%
  arrange(desc(tf_idf)) 

kable(head(top_tfidf), caption = "Term Frequency by Thread") %>%
  kable_styling(font_size = 9)

Term Frequency by Thread
thread_id	word	n	tf	idf	tf_idf
http://www.reddit.com/r/electronic_cigarette/comments/41qenv/we_have_officially_been_open_for_an_entire_year/?ref=search_posts	vapeaversary	418	0.3866790	6.298030	2.4353158
http://www.reddit.com/r/electronic_cigarette/comments/4wc97f/blue_dot_vapors_giveaway_sale_news_release_of_6/?ref=search_posts	nfda	459	0.2372093	6.298030	1.4939512
http://www.reddit.com/r/electronic_cigarette/comments/29i7n7/vaporhop_50_bottle_750_ml_itaste_kit_giveaway_37/?ref=search_posts	vaporhop	321	0.1244668	6.298030	0.7838959
http://www.reddit.com/r/electronic_cigarette/comments/3ek576/new_bot_pretends_its_you_based_on_comment_history/?ref=search_posts	user_simulator	385	0.1221059	6.298030	0.7690268
http://www.reddit.com/r/electronic_cigarette/comments/3jmxt5/labor_day_sale_and_giveaway_35_off_everything_in/?ref=search_posts	sicilian	239	0.1151252	6.298030	0.7250622
http://www.reddit.com/r/vapeitforward/comments/1wiaq3/reddit_raffle_nemmy_clone_extras_giveaway/?ref=search_posts	nemmy	487	0.1513833	4.100805	0.6207933

Word Frequency by Subreddit

tf_idf <- word_by_subreddit %>%
  bind_tf_idf(word, subreddit, n) %>%
  arrange(desc(tf_idf))

tf_idf %>%
  filter(str_detect(subreddit, "^vap|^Vap|^ele|^sci|^teen|^ecig|^Inn")) %>%
  group_by(subreddit) %>%
  top_n(12, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf-idf)) %>%
  ggplot((aes(word, tf_idf, fill = subreddit))) +
  geom_col(show.legend = FALSE) +
  facet_wrap( ~ subreddit, scale = "free") +
  ylab("tf-idf") +
  coord_flip() +
  labs(title = "Term Frequency By Subreddit")

subreddit_cors <- word_by_subreddit %>%
  pairwise_cor(subreddit, word, n, sort = TRUE)

head(subreddit_cors)

## # A tibble: 6 x 3
##   item1         item2         correlation
##   <chr>         <chr>               <dbl>
## 1 worldnews     news                0.972
## 2 news          worldnews           0.972
## 3 technology    worldnews           0.948
## 4 worldnews     technology          0.948
## 5 unitedkingdom worldnews           0.948
## 6 worldnews     unitedkingdom       0.948

set.seed(1231)

subreddit_cors %>%
  filter(correlation > .90) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size = 6, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

Topic Modeling

lda5 <- LDA(review_dtm, k = 5, control = list(seed = 1234)) 

topics <- terms(lda5, 100)
kable(head(topics)) %>%
  kable_styling()

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
juice	like	nicotine	just	thanks
com	just	people	like	giveaway
thanks	one	can	can	love
mini	can	just	get	good
name	get	smoking	vaping	awesome
smok	battery	like	one	thank

NLP result summary and discussion

Sentiment Analysis
We analyzed sentiment by word and subreddit and found that the majority of words in electronic cigarette posts and comments were positive. Users used words like “good”, “awesome”, “happy”, and “love” in discussions. Most subreddits also had an overall positive sentiment.

Relationships Between Words
Bigram - We noted strong relationships between words such as “quit and smoking”, “awesome and giveaway”, “started and vaping”. More analysis is needed to gain information about the reasons/motivations for those who wish to quit. Also, we would like to understand the role of giveaways in encouraging young smokers to start smoking/vaping.

Trigram - The trigram analysis did not reveal any significant trend related to smoking or e-cigarettes. Additional analysis is required.

Word and Document Frequency - Our analysis revealed that words used frequently in discussions were “juice”, “vaping”, and “giveaway”. Young e-cigarette users are attracted to flavored products or (juice). Additional analysis is needed to understand how these products are discussed. Also, the frequency of the word “giveaway” suggests that free products are being offered often as an enticement for users to begin smoking.

Topic Modeling - The results from the topic models are consistent with the other analyses completed. All topics contain similar e-cigarette related words.

CIS8392 Course Project

Catresa Barlow and Wan-Ting Tsai

5/1/2019