研究動機 : 便利商店在現代人的生活中佔據了非常重要的角色,從一天的三餐,到生活上的繳費,便利商店皆可提供賓至如歸的服務。截至2020年7月,7-11在台之總店數有5915間,而全家有3617間,但是市面上,全家的總體聲量屢屢創下新高,不管是全家會員APP的集點、霜淇淋或是夯番薯都讓消費者耳目一新,而7-11憑著其40年的超商經驗,也推出了許多聯名活動並以特色店作為手段製造話題性。全家究竟使用了什麼樣的策略,讓它可以和超商界的龍頭7-11平起平坐,而兩間店有什麼差別或是特色影響消費者的購買行為,為此報告中將深入探討的部分。
研究設計 :
7-11討論主題
全家討論主題
綜合7-11與全家的資料看總體文章的主題分類
看總體文章中的網絡分布及意見領袖
資料來源 : 管院文字分析平台,ptt超商版,10930筆,中文
日期 : 2015.12.31-2020.12.31
關鍵字 : 全家、7-11
系統參數設定
Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8") # 避免中文亂碼## [1] ""
安裝需要的packages
packages = c("readr", "dplyr", "jiebaR", "tidyr", "tidytext", "igraph", "topicmodels", "ggplot2", "stringr", "wordcloud2","scales", "data.table","NLP","ggraph","reshape2", "widyr", "magrittr", "topicmodels", "yaml")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)載入packages
library(readr)
library(dplyr)
library(jiebaR)
library(tidyr)
library(tidytext)
library(igraph)
library(topicmodels)
library(stringr)
library(ggplot2)
library(reshape2)
library(wordcloud2)
library(data.table)
library(purrr)
library(plotly)
require(stringr)
require(wordcloud2)
require(scales)
require(data.table)
require(NLP)
require(ggraph)
require(reshape2)
require(widyr)
require(magrittr)
library(tm)
library(LDAvis)
library(slam)
require(RColorBrewer)
mycolors <- colorRampPalette(brewer.pal(8, "Set3"))(20) #載入原始資料
posts = fread("articleMetaData.csv",encoding = 'UTF-8')#文章內容
reviews = fread("articleReviews.csv",encoding = 'UTF-8')#留言
MetaData = fread("./articleMetaData.csv",encoding = 'UTF-8')#文章內容
Reviews = fread("./articleReviews.csv",encoding = 'UTF-8')#留言
keywords_f = c('全家','family')
toMatch_f = paste(keywords_f,collapse="|")
MetaData_f = with(MetaData, MetaData[grepl(toMatch_f,sentence)|grepl(toMatch_f,artTitle),])
Reviews_f = left_join(MetaData_f, Reviews[,c("artUrl", "cmtContent")], by = "artUrl")
head(posts)## artTitle artDate artTime
## 1: [討論]去7-11繳費繳不了 2017/12/31 21:09:34
## 2: [問題]7-11購物金i贈禮擷圖使用不行? 2018/01/01 01:46:44
## 3: [商品]7-11-吉士豬肉堡加蛋新上市 2018/01/01 11:59:30
## 4: [商品]7-11御選紅燒牛肉麵半筋半肉 2018/01/01 19:38:59
## 5: [新聞]制服底下的秘密!全家店員的「白襯衫」 2018/01/01 20:33:28
## 6: [情報]7-11杜卡迪集點活動限量三環錶 2018/01/01 21:37:40
## artUrl artPoster artCat
## 1: https://www.ptt.cc/bbs/CVS/M.1514783736.A.CFC.html johnny85 CVS
## 2: https://www.ptt.cc/bbs/CVS/M.1514800367.A.BB2.html PAYPASS CVS
## 3: https://www.ptt.cc/bbs/CVS/M.1514837158.A.68D.html vhygdih CVS
## 4: https://www.ptt.cc/bbs/CVS/M.1514864701.A.950.html hoyumi CVS
## 5: https://www.ptt.cc/bbs/CVS/M.1514867972.A.80F.html TWOOOOOOOOOO CVS
## 6: https://www.ptt.cc/bbs/CVS/M.1514871823.A.06A.html whiterose512 CVS
## commentNum push boo
## 1: 10 2 0
## 2: 36 11 0
## 3: 19 10 3
## 4: 7 5 0
## 5: 17 10 4
## 6: 11 8 1
## sentence
## 1: 如題\n去7-11繳手機費\n明明沒超過繳費日期\n但是店員掃帳單條碼的時候\n就是掃不過\n怎麼會這樣?\n有人有這種經驗嗎?\n
## 2: 最近國泰優惠app刷了一堆小額5元的購物金\n\n想說怕切換耗時所以都先擷好圖放相簿\n\n好讓店員可以快速刷條碼避免等待\n\n只是那個掃條碼的機器真的不給力\n\n螢幕調到最亮了還是很難掃\n\n也試過調整掃瞄器的角度還是一樣\n\n\n在掃時也忘了先鎖定螢幕,以致店員橫向圖片縮小\n\n看到這是擷圖,就說你這是擷圖不能用\n\n用了到時有問題害我會被罵!\n\n\n我反問他,那你要取消嗎?我可以開app一個一個讓你刷\n\n但時間就會拉長,你要嗎?\n\n這時卻又不回應的繼續刷了\n\n之前openpoint換東西我也這用,蠻方便的\n\n7-11有規定不能這樣用嗎?如果有我也是很尊重的下次開app囉\n\n\n\n另也試過去ibon補印,但那鍵盤a~z照順序排列方式實在很糟\n\n相較famiport的鍵盤排列就比較適合有在打鍵盤的人\n\n但對沒在打鍵盤的人就有選字障礙\n\n這部份二家應該可以做切換選擇讓人擇優使用,是吧\n
## 3: (如有不清楚的項目,可跳過,或按 Ctrl+Y 刪除)\n【商品名稱/價格】:吉士豬肉堡加蛋 $45\n(試吃試用品請標註0元)\n【便利商店/廠商名稱】:7-11\n\n【規格/內容物/熱量】:吉士+豬肉+堡+蛋 /296 卡\n\n【評分】:75\n(未滿60分為不推薦)\n【心得】:\n\n\n好讀圖文連結:\nhttp://vhygdih0412.pixnet.net/blog/post/345462463\n7-11 吉士豬肉堡加蛋新上市 | 296 Cal | $45\nhttps://i.imgur.com/lD9o5n5.jpg\n我自己很喜歡麥當勞的豬肉滿福堡\n超商其實過去幾年都有模仿滿福堡系列\n這次看到架子上面的新款不知道是什麼時候改的\n不過看起來跟幾個月前吃的好像有點不一樣就買來試看看\nhttps://i.imgur.com/lfRzQkd.jpg\n▲國中化學有好好學嗎?XDD\n這一份卡路里296還算蠻低的\n納含量也只有滿福堡的一半\n理論上會比較''健康''\n本來女友搓了個羊毛氈\n想送給我當以後的開箱小伙伴\n....但\nhttps://i.imgur.com/iQCdnt1.jpg\n▲這是萌萌......有87分像吧XD\n反正為了自身性命著想先用了哈哈\nhttps://i.imgur.com/jPcfDRL.jpg\n▲萌萌正面還是蠻可愛的\n就耳朵大了點...XDD\nhttps://i.imgur.com/bgWqHon.jpg\n▲回到介紹這個新款的產品\n就這樣看過去還蠻像麥當勞款的哈哈\nhttps://i.imgur.com/PDgQKMb.jpg\n▲這次使用的蛋是模具煎出來的蛋\n以前的款式使用是有攪拌過的蛋液\n口感上更接近速食店\n雞蛋是石安牧場\n另外麵包還是塗上了美乃滋\n本來期待會改成奶油...\nhttps://i.imgur.com/WqyHgRr.jpg\n▲起司的擺放位置也有調整\n現在是放在蛋跟肉片之間\n可以看得出來肉片還蠻厚的\n口感跟之前吃起來也是不一樣\nhttps://i.imgur.com/yql309a.jpg\n▲產品的剖面圖\n這樣子看起來又跟麥當勞的很像哈哈\n這次新款的吉士豬肉堡加蛋\n除了改了雞蛋的樣式之外\n肉片筋跟肥肉的部分也比較多\n可是肉片應該也是用燙的不是用煎\n而且沒什麼調味味道偏淡\n不過比之前的款式有比較好吃一點了\n對腎臟負擔也不會這麼重XD\n如果晚上突然想吃的話還是會再買來吃\n但還是最期待麥當勞豬肉滿福堡加蛋全天供應啊\n
## 4: 【商品名稱/價格】:御選紅燒牛肉麵/129元\n\n【便利商店/廠商名稱】:7-11\n\n【規格/內容物/熱量】:\nhttps://i.imgur.com/Z2PYN3k.jpg\n【評分】:65分\n\n【心得】:\n\n2018年第一碗盤子專用商品 在小七發現到這個不一樣的牛肉麵\n\n區域限定 還很有小七風格 加了'御選'兩個字\nhttps://i.imgur.com/HnmcwSg.jpg\n微波後有點霧化 如果在店內看到 馬上就很明顯看出料真的很少\n\n(以前三明治還會在表面偽裝一下 現在直接讓你看出真的料很少)\n\n但還是抖M的買來吃看看\nhttps://i.imgur.com/Z75cKUw.jpg\n內容樣子\n\n上層會有一層很濃厚的油脂\nhttps://i.imgur.com/TUVtljk.jpg\n麵條是寬麵條 以超商來說還算不錯吃\nhttps://i.imgur.com/802IgNP.jpg\n肉也是少少的 就之前吃過的小七牛肉麵的肉(廢話)\nhttps://i.imgur.com/ncZSa9c.jpg\n也有些牛筋 其實我覺得還不錯\nhttps://i.imgur.com/YqnMV5D.jpg\n偶爾也會有牛肉帶筋的這種\n\n以超商牛肉麵來說 不要看價錢是還不錯\n\n只是這個價錢跟料 實在是很難給高分阿...\n\n
## 5: 制服底下的秘密!全家店員的「白襯衫」 讓網友笑噴瘋團購\n\n\n\n\n2018-01-02 12:13聯合新聞網 綜合報導\n\n\n萬能的超商店員每天穿著整齊的制服替大小顧客服務,但近日有人公開了他們制服底下的秘\n密,內裡的白襯衫竟然只有半截,讓不少看了笑說「太猛了,無極限」。\n\n\n一名男大生在\n發文表示,到全家超商拜訪昔日的室友,在店內的他原本看起來「\n衣冠整齊,沒有什麼不對勁」,但沒想到把外面制服脫下後,發現裡面的「白襯衫」,竟只\n是一個「假衣領」,制服底下的秘密完全現形,男大生還笑說「這個是肚兜、肚兜、肚兜!\n!!我看了直接傻眼」。\n\n\n原來是全家制服下規定要穿白襯衫打領帶,雖然美觀但活動也較為不便,才導致大家出遍奇\n招而衍生出「假衣領」這種穿法。不少網友看完直說太有創意,還有人揪服務業一起團購這\n個「假衣領」,也釣出不少服務業網友說,上班穿襯衫真的很不方便,戴上假衣領方便許多\n;還有人吐槽說「這不是肚兜啦,要也是圍兜兜」。\nhttps://udn.com/news/story/8864/2907185?from=udn-ch1_breaknews-1-0-news\nhttps://i.imgur.com/2sCUQ3j.jpg\n網友公開超商店員制服底下的秘密,讓不少人笑歪。圖/翻攝自
## 6: 【活動名稱】:限量三環錶+Dovi親簽安全帽、防風外套\n\n【便利商店/廠商名稱】:7-11\n\n【活動時間】:\n\n登記時間:2018/01/03 15:00~ 2018/01/16 23:59\n\n兌換時間:2018/01/18~ 2018/01/19 12:00~14:00\n\n【活動內容】:\n這波7-11的集點活動除了4點+99元就能換一款杜卡迪的模型外\nhttps://i.imgur.com/QJx3mUk.jpg\n居然還可以免費換到杜卡迪三環計時腕錶\n跟DOVI親筆簽名安全帽、防風外套!!!(我真的很想要這個阿,拿來當傳家寶也hen可以)\nhttps://i.imgur.com/bShtaPT.jpg\n但限量是殘酷的,也要搶到號碼牌才可以換\n1/3的3:00pm開放網路登記(雖然那時候在上班,但哥沒在怕的啦)\n之前搶五月天演唱會的門票好幾次都沒有搶到(甚至開始懷疑自己的人品)\n這次說什麼都要搶好搶滿\n\n活動連結:\nhttps://www.7-11pointgift.com.tw/index.aspx?EID=000002\n【注意事項】:\n
head(reviews)## artTitle artDate artTime
## 1: [討論]去7-11繳費繳不了 2017/12/31 21:09:34
## 2: [討論]去7-11繳費繳不了 2017/12/31 21:09:34
## 3: [討論]去7-11繳費繳不了 2017/12/31 21:09:34
## 4: [討論]去7-11繳費繳不了 2017/12/31 21:09:34
## 5: [討論]去7-11繳費繳不了 2017/12/31 21:09:34
## 6: [討論]去7-11繳費繳不了 2017/12/31 21:09:34
## artUrl artPoster artCat
## 1: https://www.ptt.cc/bbs/CVS/M.1514783736.A.CFC.html johnny85 CVS
## 2: https://www.ptt.cc/bbs/CVS/M.1514783736.A.CFC.html johnny85 CVS
## 3: https://www.ptt.cc/bbs/CVS/M.1514783736.A.CFC.html johnny85 CVS
## 4: https://www.ptt.cc/bbs/CVS/M.1514783736.A.CFC.html johnny85 CVS
## 5: https://www.ptt.cc/bbs/CVS/M.1514783736.A.CFC.html johnny85 CVS
## 6: https://www.ptt.cc/bbs/CVS/M.1514783736.A.CFC.html johnny85 CVS
## cmtPoster cmtStatus cmtDate
## 1: bxxl 推 2018-01-01 05:12:00
## 2: bxxl → 2018-01-01 05:13:00
## 3: johnny85 → 2018-01-01 05:28:00
## 4: fliersky → 2018-01-01 08:39:00
## 5: fliersky → 2018-01-01 08:39:00
## 6: johnny85 → 2018-01-01 08:57:00
## cmtContent
## 1: :是沒掃進去?還是掃進去後系統不通過?
## 2: :前者就螢幕擦一擦調高亮度,後者沒遇過,換別家小七或全家試試
## 3: :是沒掃進去照b大的方式再去試試看
## 4: :帳單是要印出來的嗎?注意事項不都會寫?有些印出來才
## 5: :能掃
## 6: :不是要印出來的
移除PTT貼新聞時會出現的格式用字
posts <- posts %>%
mutate(sentence=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除|【商品名稱/價格】|【活動名稱】|【便利商店/廠商名稱】|【規格/內容物/熱量】|【評分】|【活動時間】", "", sentence))
reviews <- reviews %>%
mutate(cmtContent=gsub("媒體來源|記者署名|完整新聞標題|完整新聞內文|完整新聞連結|(或短網址)|備註|備註請放最後面|違者新聞文章刪除|【商品名稱/價格】|【活動名稱】|【便利商店/廠商名稱】|【規格/內容物/熱量】|【評分】|【活動時間】", "", cmtContent))將7-11資料獨立建立data.frame
#7-11
keywords_7 = c('7-11','小7','711')
toMatch_7 = paste(keywords_7,collapse="|")
MetaData_7 = with(posts,posts[grepl(toMatch_7,sentence)|grepl(toMatch_7,artTitle),])#可以再篩掉
Reviews_7 = left_join(MetaData_7, reviews[,c("artUrl", "cmtContent")], by = "artUrl") #留言# # 文章斷句("\n\n"取代成"。")
#meta_7 <- MetaData_7 %>%
# mutate(sentence=gsub("[\n]{2,}", "。", sentence))
# 以全形或半形 驚歎號、問號、分號 以及 #全形句號 爲依據進行斷句
# sentences_7 <- strsplit(meta_7$sentence,"[。!;?!?;]")
#將每句句子,與他所屬的文章連結配對起來,整理成一個dataframe
# sentences_7 <- data.frame(
# artUrl = rep(meta_7$artUrl, sapply(sentences_7, length)),
# sentence = unlist(sentences_7)
# ) %>%
# filter(!str_detect(sentence, regex("^(\t|\n| )*$")))
# 如果有\t或\n就去掉
# sentences_7$sentence <- as.character(sentences_7$sentence)
# sentences_7設定斷詞引擎
# 文章斷詞
#load dpp_lexicon(特定要斷開的詞,像是user_dict)
#lexicon_7 <- scan(file = "D:/W16_finalnew/dict/lexicon_7.txt",
# what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
#load stop words
#stopwords_7 <- scan(file = "D:/W16_finalnew/dict/stopwords_7.txt",
# what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
#使用默認參數初始化一個斷詞引擎
# jieba_tokenizer = worker()
# 使用自訂字典重新斷詞
# new_user_word(jieba_tokenizer, c(lexicon_7))
# #tokenize function
# tokenizer_7 <- function(t) {
# lapply(t, function(x) {
# if(nchar(x)>1){
# tokens_7 <- segment(x, jieba_tokenizer)
# tokens_7 <- tokens_7[!tokens_7 %in% stopwords_7]
# # 去掉字串長度爲1的詞彙
# tokens_7 <- tokens_7[nchar(tokens_7)>1]
# return(tokens_7)
# }
# })
# }
## 用剛剛初始化的斷詞器把sentence斷開
# tokens_7 <- sentences_7 %>%
# mutate(sentence = gsub("[[:punct:]]", "",sentence)) %>%
# mutate(sentence = gsub("[0-9a-zA-Z]", "",sentence)) %>%
# unnest_tokens(word, sentence, token=tokenizer_7) %>%
# count(artUrl, word) %>% # 計算每篇文章出現的字頻
# rename(count=n)
# tokens
#save.image(file = "D:/W16_finalnew/token_result/token_result_7.rdata")7-11斷詞結果
load("token_result/token_result_7.rdata")freq = 3
# 依據字頻挑字
reserved_word_7 <- tokens_7 %>%
group_by(word) %>%
count() %>%
filter(n > freq) %>%
unlist()
removed_7 <- tokens_7 %>%
filter(word %in% reserved_word_7)
#dpp_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
dtm_7 <- removed_7 %>% cast_dtm(artUrl, word, count) #ldas_7 = c()
#topics_7 = c(2,4,6,10,15)
#for(topic in topics_7){
#start_time <- Sys.time()
#lda_7 <- LDA(dtm_7, k = topic, control = list(seed = 2021))
#ldas_7 =c(ldas_7,lda_7)
#print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
#save(ldas_7,file = "D:/W16_finalnew/ldas_result/ldas_result_7.rdata") # 將模型輸出成檔案
# }7-11 LDA主題資料
load("ldas_result/ldas_result_7.rdata")透過perplexity找到最佳主題數
topics = c(2,4,6,10,15)
data_frame(k = topics, perplex = map_dbl(ldas_7, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
可以看出大於四小於八的主題數perplexity較低。
畫LDAvis 模型
library(text2vec)## Warning: package 'text2vec' was built under R version 4.0.5
##
## Attaching package: 'text2vec'
## The following object is masked from 'package:topicmodels':
##
## perplexity
## The following object is masked from 'package:igraph':
##
## normalize
library(udpipe)## Warning: package 'udpipe' was built under R version 4.0.5
dtf_7 <- document_term_frequencies(tokens_7, document = "artUrl", term = "word")
dtm_7 <- document_term_matrix(x = dtf_7)
dtm_clean_7 <- dtm_remove_lowfreq(dtm_7, minfreq = 30)
dim(dtm_clean_7)## [1] 4314 1228
set.seed(2019)
topic_n_7 = 3
lda_model_7 =text2vec::LDA$new(n_topics = topic_n_7,doc_topic_prior = 0.1, topic_word_prior = 0.001)
doc_topic_distr_7 =lda_model_7$fit_transform(dtm_clean_7, n_iter = 1000, convergence_tol = 1e-5,check_convergence_every_n = 100)## INFO [17:38:46.645] early stopping at 200 iteration
## INFO [17:38:47.353] early stopping at 50 iteration
lda_model_7$get_top_words(n = 10, lambda = 0.5) ## 查看 前10主題字## [,1] [,2] [,3]
## [1,] "心得" "活動" "試吃"
## [2,] "起來" "店員" "試用品"
## [3,] "味道" "請問" "可跳過"
## [4,] "覺得" "謝謝" "刪除"
## [5,] "口感" "門市" "如有"
## [6,] "不會" "注意事項" "項目"
## [7,] "不錯" "附近" "清楚"
## [8,] "比較" "名稱" "滿分"
## [9,] "看到" "全家" "標註"
## [10,] "這款" "使用" "推薦"
#lda_model_7$plot()
# lda_model$plot(out.dir ="lda_result", open.browser = TRUE)LDAvis顯示出除了中間一個較大的主題有稍微涵蓋到另外三個,這四個主題是互不重疊的
# LDA分成3個主題
lda_7 <- LDA(dtm_7, k = 3, control = list(seed = 123))尋找Topic的代表字
removed_word_7 = c("不是","每天","出來","覺得", "這次", "還是", "刪除", "心得")
# 看各群的常用詞彙
tidy(lda_7, matrix = "beta") %>% # 取出topic term beta值
filter(! term %in% removed_word_7) %>%
group_by(topic) %>%
top_n(10, beta) %>% # beta值前10的字
ungroup() %>%
mutate(topic = as.factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()取出代表主題(topic),每篇文章拿gamma值最大的topic當該文章的topic
# 在tidy function中使用參數"gamma"來取得 theta矩陣
topics_7 <- tidy(lda_7, matrix="gamma") %>% # document topic gamma
group_by(document) %>%
top_n(1, wt=gamma)
topics_7## # A tibble: 4,315 x 3
## # Groups: document [4,315]
## document topic gamma
## <chr> <int> <dbl>
## 1 https://www.ptt.cc/bbs/CVS/M.1451661374.A.64F.html 1 0.343
## 2 https://www.ptt.cc/bbs/CVS/M.1451968352.A.A17.html 1 0.335
## 3 https://www.ptt.cc/bbs/CVS/M.1452269826.A.568.html 1 0.336
## 4 https://www.ptt.cc/bbs/CVS/M.1452417742.A.F0D.html 1 0.335
## 5 https://www.ptt.cc/bbs/CVS/M.1452655607.A.EBF.html 1 0.345
## 6 https://www.ptt.cc/bbs/CVS/M.1452685957.A.A3B.html 1 0.411
## 7 https://www.ptt.cc/bbs/CVS/M.1452692377.A.1BF.html 1 0.340
## 8 https://www.ptt.cc/bbs/CVS/M.1453045186.A.162.html 1 0.351
## 9 https://www.ptt.cc/bbs/CVS/M.1453054347.A.ED2.html 1 0.345
## 10 https://www.ptt.cc/bbs/CVS/M.1453181617.A.62F.html 1 0.372
## # ... with 4,305 more rows
資料內容探索
posts_topic_7 <- merge(x = MetaData_7, y = topics_7, by.x = "artUrl", by.y="document")
# 看一下各主題在說甚麼
set.seed(123)
posts_topic_7 %>% # 主題一
filter(topic==1) %>%
select(artTitle) %>%
unique() %>%
sample_n(10)## artTitle
## 1: [商品]7-11zyliss玩色刀具
## 2: [商品]全家究極牛奶泡芙
## 3: [商品]全家麻辣牛肉鍋新上市
## 4: [商品]7-11韓式風味炸雞&辣味炸雞球
## 5: [商品]7-11蜜汁烤雞玉子三明治新上市
## 6: [情報]全家「玩具總動員」集點活動(7/31起)
## 7: [資訊]7-11領口罩─用ibon的LINE官方帳號出條碼
## 8: [商品]7-11椒麻雞翅
## 9: [商品]7-1121風味館手撕雞鮮蔬沙拉
## 10: [商品]7-11泰奶雪糕/摩摩喳喳雪糕
posts_topic_7 %>% # 主題二
filter(topic==2) %>%
select(artTitle) %>%
unique() %>%
sample_n(10)## artTitle
## 1: [商品]7-11初鹿牧場草莓濃奶茶
## 2: [問題]7-11秋季進口啤酒
## 3: [商品]7-11台南風味意麵
## 4: [商品]7-11福樂鮮攪海鹽焦糖奶茶
## 5: [商品]20207-11399福袋
## 6: [商品]7-11泰式打拋雞御飯糰
## 7: [商品]7-11香蒜白酒蛤蜊義大利
## 8: [商品]7-11七七黑巧杏仁乳加
## 9: [問題]7-11樂事富貴蹄膀口味
## 10: [商品]7-11泰式紅咖哩飯
posts_topic_7 %>% # 主題三
filter(topic==3) %>%
select(artTitle) %>%
unique() %>%
sample_n(10)## artTitle
## 1: [商品]全家辣莎莎烤雞歐姆蛋燴飯
## 2: [商品]全家黃金排骨炒飯新上市
## 3: [商品]7-11大杯緣子熱血分享
## 4: [商品]7-11乳酪地瓜燒
## 5: [商品]7-11秋雅蘋果梅子醋
## 6: [商品]7-11櫻花風草莓歐蕾
## 7: [商品]7-11日式炒烏龍麵
## 8: Re:[討論]7-11兌換統一發票獎金存icash送卡
## 9: [商品]7-11FeedFit輕享冰淇淋—大人可可
## 10: [商品]7-11雙層牛肉堡試吃心得
可以歸納出
topic 1 = “711新出商品及問題”
topic 2 = “商品售價及點數”
topic 3 = “711服務項目及福袋聯名”
看主題數量
posts_topic_7 %>%
group_by(artCat,topic) %>%
summarise(sum = n()) %>%
ggplot(aes(x= artCat,y=sum,fill=as.factor(topic))) +
geom_col(position="dodge") ## `summarise()` has grouped output by 'artCat'. You can override using the `.groups` argument.
MetaData_f %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate) %>%
summarise(count = n())%>%
ggplot(aes(artDate,count))+
geom_line(color="red")+
geom_point()從上圖可見 , 有特定幾天的討論度特別高, 經看發現都在討論全家活動的Pos,這也影響了接下來的LDA分析結果.
load("./token_result.rdata")freq = 2
# 依據字頻挑字
reserved_word <- tokens %>%
group_by(word) %>%
count() %>%
filter(n > freq) %>%
unlist()
mask_removed <- tokens %>%
filter(word %in% reserved_word)
#mask_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
mask_dtm <- mask_removed %>% cast_dtm(artUrl, word, count)tokens <- MetaData_f %>%
unnest_tokens(word, sentence, token=chi_tokenizer) %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))| str_detect(word, regex("[Aa][Zz]")))
dtf <- document_term_frequencies(tokens, document = "artUrl", term = "word")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 30)
dim(dtm_clean)
set.seed(2021)
topic_n = 4
lda_model =text2vec::LDA$new(n_topics = topic_n,doc_topic_prior = 0.1, topic_word_prior = 0.001)
doc_topic_distr =lda_model$fit_transform(dtm_clean, n_iter = 1000, convergence_tol = 1e-5,check_convergence_every_n = 100)
lda_model$get_top_words(n = 10, lambda = 0.5) ## 查看 前10主題字
lda_model$plot()
# lda_model$plot(out.dir ="lda_result", open.browser = TRUE)result_for_4
先以LDAvis建立4個topic之圖片, 可見topic 1 , 2 有重合在一起 , 所以4個topic 並不是一個好的選擇.
tokens <- MetaData_f %>%
unnest_tokens(word, sentence, token=chi_tokenizer) %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))| str_detect(word, regex("[Aa][Zz]")))
dtf <- document_term_frequencies(tokens, document = "artUrl", term = "word")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 30)
dim(dtm_clean)
set.seed(2021)
topic_n = 3
lda_model =text2vec::LDA$new(n_topics = topic_n,doc_topic_prior = 0.1, topic_word_prior = 0.001)
doc_topic_distr =lda_model$fit_transform(dtm_clean, n_iter = 1000, convergence_tol = 1e-5,check_convergence_every_n = 100)
lda_model$get_top_words(n = 10, lambda = 0.5) ## 查看 前10主題字
lda_model$plot()
# lda_model$plot(out.dir ="lda_result", open.browser = TRUE)result_for_3
重新以LDAvis建立3個topic之圖片, 可見沒有topic重合在一起 , 所以3個topic 是比較理想之LDA分類數目.
terms依照各主題的phi值由大到小排序,列出前10大
# LDA分成3個主題
mask_lda <- LDA(mask_dtm, k = 3, control = list(seed = 2021))removed_word = c("全家","便利商店","價格","廠商","心得","沒有","內容","可以","商品名稱","規格","熱量","覺得")
# 看各群的常用詞彙
tidy(mask_lda, matrix = "beta") %>% # 取出topic term beta值
filter(! term %in% removed_word) %>%
group_by(topic) %>%
top_n(10, beta) %>% # beta值前10的字
ungroup() %>%
mutate(topic = as.factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()從以上結果可見 , 主題一比較像是在討論活動主題 , 而主題二是比較著重在討論食品相關之話題 , 而主題三比較像在討論全家整體的形像如店員 , 活動及食品.
topics_name = c("活動討論","食品討論","整體討論")# 在tidy function中使用參數"gamma"來取得 theta矩陣
mask_topics <- tidy(mask_lda, matrix="gamma") %>% # document topic gamma
group_by(document) %>%
top_n(1, wt=gamma)
mask_topics## # A tibble: 7,331 x 3
## # Groups: document [7,331]
## document topic gamma
## <chr> <int> <dbl>
## 1 https://www.ptt.cc/bbs/CVS/M.1451661466.A.A21.html 1 0.339
## 2 https://www.ptt.cc/bbs/CVS/M.1451750974.A.F67.html 1 0.342
## 3 https://www.ptt.cc/bbs/CVS/M.1451875948.A.D8E.html 1 0.340
## 4 https://www.ptt.cc/bbs/CVS/M.1451923047.A.545.html 1 0.345
## 5 https://www.ptt.cc/bbs/CVS/M.1451988141.A.03C.html 1 0.336
## 6 https://www.ptt.cc/bbs/CVS/M.1452067191.A.AE0.html 1 0.344
## 7 https://www.ptt.cc/bbs/CVS/M.1452134431.A.7A7.html 1 0.343
## 8 https://www.ptt.cc/bbs/CVS/M.1452252782.A.E04.html 1 0.350
## 9 https://www.ptt.cc/bbs/CVS/M.1452597707.A.49E.html 1 0.347
## 10 https://www.ptt.cc/bbs/CVS/M.1452620790.A.E0C.html 1 0.353
## # ... with 7,321 more rows
posts_topic <- merge(x = MetaData_f, y = mask_topics, by.x = "artUrl", by.y="document")
posts_topic %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate=format(artDate,'%Y%m'),topic) %>%
summarise(sum =sum(topic)) %>%
ggplot(aes(x= artDate,y=sum,fill=as.factor(topic))) +
geom_col(position="fill") ## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
從日題及主題比例來看 , 主題三比較常多人討論 , 但如剛所說 , 主題1的活動討論比較多議題及吸引人 ,因主要討論全家各式活動 , 故對主題1進一步探討
posts_topic <- merge(x = MetaData_f, y = mask_topics, by.x = "artUrl", by.y="document")
# 看一下各主題在說甚麼
set.seed(2021)
posts_topic %>% # 主題1
filter(topic==1) %>%
select(artTitle) %>%
unique() %>%
sample_n(5)## artTitle
## 1: [商品]全家萌寵戰力衣
## 2: [商品]全家-義美咖啡牛乳
## 3: [問題]在全家能用台新簽帳金融卡繳費的項目
## 4: [商品]7-11巧酥奶霜白燒
## 5: [商品]全家FMC古法缸釀炸醬麵
建立主題1文字雲
topic3_data = posts_topic %>% # 主題1
filter(topic==1)
# 加入自定義的字典
jieba_tokenizer <- worker(user="user_dict.txt", stop_word = ("stop_words.txt"))
# 設定斷詞function
customized_tokenizer <- function(t) {
lapply(t, function(x) {
tokens <- segment(x, jieba_tokenizer)
return(tokens)
})
}
MToken_f <- topic3_data %>% unnest_tokens(word, sentence, customized_tokenizer)
# 格式化日期欄位
MToken_f$artDate= MToken_f$artDate %>% as.Date("%Y/%m/%d")
# 過濾特殊字元
data_select_f = MToken_f %>%
filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
filter(!grepl("['^a-zA-Z']",word)) %>% # 去英文、數字
filter(nchar(.$word)>1)#選單字兩個字以上的
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count_f <- data_select_f %>%
select(artDate,word) %>%
group_by(artDate,word) %>%
summarise(count=n()) %>% # 算字詞單篇總數用summarise
filter(count>3) %>% # 過濾出現太少次的字
arrange(desc(count))## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
word_count_f_2 = subset(word_count_f, !word_count_f$word %in% removed_word)
word_count_f_2 %>%
select(word,count) %>%
group_by(word) %>%
summarise(count = sum(count)) %>%
arrange(desc(count)) %>%
wordcloud2()## Adding missing grouping variables: `artDate`
# 文章斷句("\n\n"取代成"。")
#meta_a <- posts %>%
#mutate(sentence=gsub("[\n]{2,}", "。", sentence))
# 以全形或半形 驚歎號、問號、分號 以及 #全形句號 爲依據進行斷句
#sentences_a <- strsplit(meta_a$sentence,"[。!;?!?;]")
#將每句句子,與他所屬的文章連結配對起來,整理成一個dataframe
#sentences_a <- data.frame(
# artUrl = rep(meta_a$artUrl, sapply(sentences_a, length)),
# sentence = unlist(sentences_a)
# ) %>%
# filter(!str_detect(sentence, regex("^(\t|\n| )*$")))
# 如果有\t或\n就去掉
#sentences_a$sentence <- as.character(sentences_a$sentence)
#sentences_a文章斷詞
# 文章斷詞
#load dpp_lexicon(特定要斷開的詞,像是user_dict)
#lexicon_a <- scan(file = "dict/con_lexicon_a.txt",
# what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
#load stop words
#stopwords_a <- scan(file = "dict/stop_words_a.txt",
# what=character(),sep='\n',
# encoding='utf-8',fileEncoding='utf-8')
#使用默認參數初始化一個斷詞引擎
#jieba_tokenizer = worker()
# 使用自訂字典重新斷詞
#new_user_word(jieba_tokenizer, c(lexicon_a))
#tokenize function
# tokenizer_a <- function(t) {
# lapply(t, function(x) {
# if(nchar(x)>1){
# tokens_a <- segment(x, jieba_tokenizer)
# tokens_a <- tokens_a[!tokens_a %in% stopwords_a]
# 去掉字串長度爲1的詞彙
# tokens_a <- tokens_a[nchar(tokens_a)>1]
# return(tokens_a)
# }
# })
# }
## 用剛剛初始化的斷詞器把sentence斷開
# tokens_a <- sentences_a %>%
# mutate(sentence = gsub("[[:punct:]]", "",sentence)) %>%
# mutate(sentence = gsub("[0-9a-zA-Z]", "",sentence)) %>%
# unnest_tokens(word, sentence, token=tokenizer_a) %>%
# count(artUrl, word) %>% # 計算每篇文章出現的字頻
# rename(count=n)
#tokens_a
#save.image(file = "token_result/token_result_a.rdata")load("token_result/token_result_a.rdata")reserved_word <- tokens_a %>%
group_by(word) %>%
count() %>%
filter(n > 3) %>% #選取字頻大於3的字詞
unlist()
ele_removed <- tokens_a %>%
filter(word %in% reserved_word)
#mask_dtm 裡面 nrow:幾篇文章 ; ncol:幾個字
ele_dtm <- ele_removed %>% cast_dtm(artUrl, word, count) ldas = c()
topics = c(2,4,6,8,12)
for(topic in topics){
start_time <- Sys.time()
lda <- LDA(ele_dtm , k = topic, control = list(seed = 2021))
ldas =c(ldas,lda)
print(paste(topic ,paste("topic(s) and use time is ", Sys.time() -start_time)))
save(ldas,file = "ldas_result_a.rdata") # 將模型輸出成檔案
}儲存LDA主題資料
load("ldas_result/ldas_result_a.rdata")透過perplexity找到最佳主題數
topics = c(2,4,6,8)
data_frame(k = topics, perplex = map_dbl(ldas, topicmodels::perplexity)) %>%
ggplot(aes(k, perplex)) +
geom_point() +
geom_line() +
labs(title = "Evaluating LDA topic models",
subtitle = "Optimal number of topics (smaller is better)",
x = "Number of topics",
y = "Perplexity")畫LDAvis
jieba_tokenizer = worker()
news_tokenizer <- function(t) {
lapply(t, function(x) {
if(nchar(x)>1){
tokens <- segment(x, jieba_tokenizer)
# 去掉字串長度爲1的詞彙
tokens <- tokens[nchar(tokens)>1]
return(tokens)
}
})
}library(text2vec)
library(udpipe)
dtf <- document_term_frequencies(tokens_a, document = "artUrl", term = "word")
dtm <- document_term_matrix(x = dtf)
dtm_clean <- dtm_remove_lowfreq(dtm, minfreq = 30)
dim(dtm_clean)## [1] 10930 2706
建LDA模型
set.seed(2019)
topic_n = 2
lda_model =text2vec::LDA$new(n_topics = topic_n,doc_topic_prior = 0.1, topic_word_prior = 0.001)
doc_topic_distr =lda_model$fit_transform(dtm_clean, n_iter = 1000, convergence_tol = 1e-5,check_convergence_every_n = 100)## INFO [17:39:50.857] early stopping at 210 iteration
## INFO [17:39:51.607] early stopping at 20 iteration
LDAvis
lda_model$get_top_words(n = 10, lambda = 0.5) ## 查看前10主題字## [,1] [,2]
## [1,] "全家" "心得"
## [2,] "活動" "推薦"
## [3,] "店員" "滿分"
## [4,] "請問" "試吃"
## [5,] "謝謝" "試用品"
## [6,] "知道" "味道"
## [7,] "注意事項" "覺得"
## [8,] "便利商店" "起來"
## [9,] "大家" "不會"
## [10,] "內容" "好吃"
#lda_model$plot()
#lda_model$plot(out.dir ="lda_result", open.browser = TRUE)先試分主題名稱:
主題一 :優惠及服務 主題二 :商品消息
# LDA分成2個主題
ele_lda <- LDA(ele_dtm, k = 2, control = list(seed = 123))看代表字
removed_word = c("全家","便利商店","價格","真的","規格","心得")
# 看各群的常用詞彙
tidy(ele_lda, matrix = "beta") %>% # 取出topic term beta值
filter(! term %in% removed_word) %>%
group_by(topic) %>%
top_n(15, beta) %>% # beta值前10的字
ungroup() %>%
mutate(topic = as.factor(topic),
term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = topic)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()取代表主題
# 在tidy function中使用參數"gamma"來取得 theta矩陣
ele_topics <- tidy(ele_lda, matrix="gamma") %>% # document topic gamma
group_by(document) %>%
top_n(1, wt=gamma)
ele_topics## # A tibble: 10,930 x 3
## # Groups: document [10,930]
## document topic gamma
## <chr> <int> <dbl>
## 1 https://www.ptt.cc/bbs/CVS/M.1451621733.A.AA4.html 1 0.517
## 2 https://www.ptt.cc/bbs/CVS/M.1451638907.A.AAA.html 1 0.541
## 3 https://www.ptt.cc/bbs/CVS/M.1451661374.A.64F.html 1 0.506
## 4 https://www.ptt.cc/bbs/CVS/M.1451661466.A.A21.html 1 0.501
## 5 https://www.ptt.cc/bbs/CVS/M.1451670968.A.0E5.html 1 0.560
## 6 https://www.ptt.cc/bbs/CVS/M.1451750974.A.F67.html 1 0.528
## 7 https://www.ptt.cc/bbs/CVS/M.1451803298.A.1D3.html 1 0.501
## 8 https://www.ptt.cc/bbs/CVS/M.1451875948.A.D8E.html 1 0.500
## 9 https://www.ptt.cc/bbs/CVS/M.1451906071.A.457.html 1 0.526
## 10 https://www.ptt.cc/bbs/CVS/M.1452067191.A.AE0.html 1 0.542
## # ... with 10,920 more rows
看各個主題文章細目
posts_topic <- merge(x = posts, y = ele_topics, by.x = "artUrl", by.y="document")
set.seed(123)
posts_topic %>% # 主題一
filter(topic==1) %>%
select(artTitle) %>%
unique() %>%
sample_n(20)## artTitle
## 1: [商品]全家壽司米蛋糕
## 2: [討論]全家紅豆銅鑼燒消失了
## 3: [商品]7-11中村芋泥蛋糕捲
## 4: [問題]全家酷繽沙冰沙咖啡口味
## 5: [商品]全家香菇蛋黃肉包
## 6: [商品]7-11太和殿麻辣燙
## 7: [商品]全家-苺果穀物脆片優格
## 8: [商品]7-11微醉柑橘沙瓦
## 9: [商品]7-11卡娜赫拉的小動物-Usagiicash2.0
## 10: [商品]7-11芋香麵包
## 11: [問題]7-11「隨時選」結帳疑問
## 12: [商品]全家炙燒培根卷
## 13: [商品]全家義式烤雞奶油義大利麵
## 14: [商品]全家PANTONE生活節叉匙組
## 15: [新聞]全家280元股價罕見飛越龍頭小七 喜憂都
## 16: [商品]全家一日蔬果100%番茄甘蔗汁
## 17: [商品]全家溏心蛋沙拉三明治
## 18: [商品]7-11新感覺雞蛋沙拉
## 19: [商品]7-11優<U+83D3>甜坊黑糖豆花(浮雲花)/仙草凍(
## 20: [問題]全家涼麵蒜泥醬呢?
set.seed(123)
posts_topic %>% # 主題二
filter(topic==2) %>%
select(artTitle) %>%
unique() %>%
sample_n(20)## artTitle
## 1: [問題]7-11集點貼紙問題
## 2: 贈送7-11AB優酪乳250ml
## 3: [商品]全家西西里咖啡
## 4: [商品]全家x馬辣宮廷花雕雞鍋新發售
## 5: [商品]7-11杜老爺X金格蜂蜜蛋糕雪糕
## 6: [商品]全家KITKAT酷繽沙
## 7: [商品]7-11燻雞法式烤土司新上市
## 8: [商品]7-11義美台灣花生巧克力雪糕
## 9: [商品]全家拿鐵霜淇淋
## 10: [資訊]7-11制服確定版?
## 11: [商品]全家多重起司培根麵包
## 12: [商品]7-11川味麻辣湯底
## 13: [商品]全家香菇雞燉湯
## 14: [商品]全家台啤柚稚
## 15: [商品]全家橙汁鮮彩烤雞沙拉
## 16: [商品]全家杜老爺淇淋巧酥甜筒
## 17: [商品]7-11新感覺雞蛋沙拉
## 18: [商品]7-11雞肉紅藜蔬菜濃湯
## 19: [商品]全家伯爵巧克力餅乾泡芙
## 20: [商品]全家易口舒脆皮薄荷糖
看主題數量
posts_topic %>%
group_by(artCat,topic) %>%
summarise(sum = n()) %>%
ggplot(aes(x= artCat,y=sum,fill=as.factor(topic))) +
geom_col(position="dodge") ## `summarise()` has grouped output by 'artCat'. You can override using the `.groups` argument.
看文章的分布日期,以便後續討論
posts %>%
mutate(artDate = as.Date(artDate)) %>%
group_by(artDate) %>%
summarise(count = n())%>%
ggplot(aes(artDate,count))+
geom_line(color="royalblue")+
geom_point()->post_plot
ggplotly(post_plot)result_for_hist
# 文章和留言
reviews <- reviews %>%
select(artUrl, cmtPoster, cmtStatus, cmtContent)
posts_Reviews <- merge(x = posts, y = reviews, by = "artUrl")
# 把文章和topic
posts_Reviews <- merge(x = posts_Reviews, y = ele_topics, by.x = "artUrl", by.y="document")
head(posts_Reviews,3)## artUrl
## 1: https://www.ptt.cc/bbs/CVS/M.1451621733.A.AA4.html
## 2: https://www.ptt.cc/bbs/CVS/M.1451621733.A.AA4.html
## 3: https://www.ptt.cc/bbs/CVS/M.1451621733.A.AA4.html
## artTitle artDate artTime artPoster artCat commentNum
## 1: [商品]全家狠飽印度咖哩炒飯 2015/12/31 20:09:27 peter0627 CVS 8
## 2: [商品]全家狠飽印度咖哩炒飯 2015/12/31 20:09:27 peter0627 CVS 8
## 3: [商品]全家狠飽印度咖哩炒飯 2015/12/31 20:09:27 peter0627 CVS 8
## push boo
## 1: 3 0
## 2: 3 0
## 3: 3 0
## sentence
## 1: : 狠飽印度咖哩炒飯 原價70 我有國軍福利卡九折63\nhttp://imgur.com/seJxWTj\n: 全家\n\n:\nhttp://imgur.com/GVmOqMj\n: 有九折60 沒九折50\n\n【心得】:\n\n早上9:30 點了份早餐 結果10點才有空吃\n\n然後又去買了杯飲料 到中午12點根本吃不下任何東西.....\n\n最後兩點半差不多餓了才去買吃的\n\n看到新商品(應該吧....\n\n想說這一份應該滿飽的\n\n平常買便利商店的微波燴飯通常還要再加一個三角飯糰才會飽\n\n要不然到晚餐之前又會想吃零食 = =\n\n開箱(?)圖\nhttp://imgur.com/VyWWOCT\n飯還滿多的 咖哩醬料超級少的.... WTF\n\n第一直覺感覺是那些醬料拌飯應該是不夠量的\n\n結果還真的 有些吃進嘴裡根本沒咖哩醬料......\n\n不過 飯上面好像有灑一些香料(? 辣椒粉(?\n\n不至於說沒沾到醬料就完全沒有味道\n\n如果你跟全家的店員很熟的話可以考慮淋點關東煮湯配飯 XDDDDDDD\n\n要不然感覺沒拌到醬料的飯就有點乾乾的..\n\n這一份吃完還滿飽的 只是感覺.....有點空虛\n\n不過能吃飽比較重要....\n\n
## 2: : 狠飽印度咖哩炒飯 原價70 我有國軍福利卡九折63\nhttp://imgur.com/seJxWTj\n: 全家\n\n:\nhttp://imgur.com/GVmOqMj\n: 有九折60 沒九折50\n\n【心得】:\n\n早上9:30 點了份早餐 結果10點才有空吃\n\n然後又去買了杯飲料 到中午12點根本吃不下任何東西.....\n\n最後兩點半差不多餓了才去買吃的\n\n看到新商品(應該吧....\n\n想說這一份應該滿飽的\n\n平常買便利商店的微波燴飯通常還要再加一個三角飯糰才會飽\n\n要不然到晚餐之前又會想吃零食 = =\n\n開箱(?)圖\nhttp://imgur.com/VyWWOCT\n飯還滿多的 咖哩醬料超級少的.... WTF\n\n第一直覺感覺是那些醬料拌飯應該是不夠量的\n\n結果還真的 有些吃進嘴裡根本沒咖哩醬料......\n\n不過 飯上面好像有灑一些香料(? 辣椒粉(?\n\n不至於說沒沾到醬料就完全沒有味道\n\n如果你跟全家的店員很熟的話可以考慮淋點關東煮湯配飯 XDDDDDDD\n\n要不然感覺沒拌到醬料的飯就有點乾乾的..\n\n這一份吃完還滿飽的 只是感覺.....有點空虛\n\n不過能吃飽比較重要....\n\n
## 3: : 狠飽印度咖哩炒飯 原價70 我有國軍福利卡九折63\nhttp://imgur.com/seJxWTj\n: 全家\n\n:\nhttp://imgur.com/GVmOqMj\n: 有九折60 沒九折50\n\n【心得】:\n\n早上9:30 點了份早餐 結果10點才有空吃\n\n然後又去買了杯飲料 到中午12點根本吃不下任何東西.....\n\n最後兩點半差不多餓了才去買吃的\n\n看到新商品(應該吧....\n\n想說這一份應該滿飽的\n\n平常買便利商店的微波燴飯通常還要再加一個三角飯糰才會飽\n\n要不然到晚餐之前又會想吃零食 = =\n\n開箱(?)圖\nhttp://imgur.com/VyWWOCT\n飯還滿多的 咖哩醬料超級少的.... WTF\n\n第一直覺感覺是那些醬料拌飯應該是不夠量的\n\n結果還真的 有些吃進嘴裡根本沒咖哩醬料......\n\n不過 飯上面好像有灑一些香料(? 辣椒粉(?\n\n不至於說沒沾到醬料就完全沒有味道\n\n如果你跟全家的店員很熟的話可以考慮淋點關東煮湯配飯 XDDDDDDD\n\n要不然感覺沒拌到醬料的飯就有點乾乾的..\n\n這一份吃完還滿飽的 只是感覺.....有點空虛\n\n不過能吃飽比較重要....\n\n
## cmtPoster cmtStatus cmtContent
## 1: helloonew401 → :覺得超乾……又乾又很多有點煩…
## 2: q2203649 推 :真的很不解,為什麼超商有醬料的料理醬都給超少,是很貴
## 3: q2203649 → :嗎?
## topic gamma
## 1: 1 0.5170661
## 2: 1 0.5170661
## 3: 1 0.5170661
留言者與發文者
link <- posts_Reviews %>% select(cmtPoster, artPoster, artUrl)
head(link,3)## cmtPoster artPoster artUrl
## 1: helloonew401 peter0627 https://www.ptt.cc/bbs/CVS/M.1451621733.A.AA4.html
## 2: q2203649 peter0627 https://www.ptt.cc/bbs/CVS/M.1451621733.A.AA4.html
## 3: q2203649 peter0627 https://www.ptt.cc/bbs/CVS/M.1451621733.A.AA4.html
篩選資料
posts %>%
filter(commentNum<100) %>%
ggplot(aes(x=commentNum)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
+ 留言數大多分布在10
# 帳號發文篇數
post_count = posts %>%
group_by(artPoster) %>%
summarise(count = n()) %>%
arrange(desc(count))
post_count## # A tibble: 4,782 x 2
## artPoster count
## <chr> <int>
## 1 ykch 311
## 2 medama 89
## 3 lucicarr 80
## 4 nandan1202 73
## 5 B9560020 62
## 6 bleachwe 55
## 7 KoyaB 55
## 8 rinhh 51
## 9 Donnas 50
## 10 chunliang 47
## # ... with 4,772 more rows
# 帳號回覆總數
review_count = reviews %>%
group_by(cmtPoster) %>%
summarise(count = n()) %>%
arrange(desc(count))
review_count## # A tibble: 30,330 x 2
## cmtPoster count
## <chr> <int>
## 1 kaojet 2355
## 2 w4 1830
## 3 leejee 1623
## 4 tengobo 1584
## 5 bailan 1077
## 6 nissy 1067
## 7 fcz973 974
## 8 thouloveme 924
## 9 wtfconk 857
## 10 wi22900 853
## # ... with 30,320 more rows
# 發文者
poster_select <- post_count %>% filter(count >= 2) #發文兩篇以上
posts <- posts %>% filter(posts$artPoster %in% poster_select$artPoster)
# 回覆者
reviewer_select <- review_count %>% filter(count >= 20) #回覆者超過20人
reviews <- reviews %>% filter( reviews$cmtPoster %in% reviewer_select$cmtPoster)# 檢視參與人數
length(unique(posts_Reviews$artPoster)) # 發文者數量 4744## [1] 4744
length(unique(posts_Reviews$cmtPoster)) # 回覆者數量 30330## [1] 30330
allPoster <- c(posts_Reviews$artPoster, posts_Reviews$cmtPoster) # 總參與人數 31394
length(unique(allPoster))## [1] 31394
userList <- data.frame(user=unique(allPoster)) %>%
mutate(type=ifelse(user%in%posts$artPoster, "poster", "replyer"))
head(userList,3)## user type
## 1 peter0627 poster
## 2 j035p replyer
## 3 KingTom replyer
挑選2019/1/9作為分析的代表日
link <- posts_Reviews %>%
group_by(cmtPoster, artUrl) %>%
#filter(n()>1) %>%
filter(commentNum > 20) %>%
filter(artDate == as.Date('2019-01-09')) %>%
#filter(topic == 1| topic == 2) %>%
select(cmtPoster, artPoster, artCat, artUrl, topic) %>%
unique()
link## # A tibble: 264 x 5
## # Groups: cmtPoster, artUrl [264]
## cmtPoster artPoster artCat artUrl topic
## <chr> <chr> <chr> <chr> <int>
## 1 asppsa brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 2 xsexyx brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 3 mindy201 brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 4 keiko198 brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 5 CRS339 brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 6 match123 brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 7 kikiwind brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 8 sleepsnow brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 9 Lynn4706 brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## 10 dfg22200q brunomarsfan CVS https://www.ptt.cc/bbs/CVS/M.1547033482.~ 2
## # ... with 254 more rows
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
head(filtered_user,5)## user type
## 1 satan2619 replyer
## 2 chelsea1035 replyer
## 3 devilcats replyer
## 4 gjaej replyer
## 5 knnp replyer
filter_degree = 7
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "orange", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$topic == "2", "goldenrod2", "seagreen")
# 畫出社群網路圖(degree>7的才畫出來)
set.seed(5432)
plot(reviewNetwork, vertex.size=5, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("orange","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("優惠、服務、消息","商品討論"),
col=c("goldenrod2", "seagreen"), lty=1, cex=1)filter_degree = 5 # 使用者degree
# 過濾留言者對發文者的推噓程度
link <- posts_Reviews %>%
filter(commentNum > 25) %>%
filter(cmtStatus!="→") %>%
group_by(cmtPoster, artUrl) %>%
filter( n() > 5) %>%
ungroup() %>%
select(cmtPoster, artPoster, artUrl, cmtStatus) %>%
unique()
# 篩選link中有出現的使用者
filtered_user <- userList %>%
filter(user%in%link$cmtPoster | user%in%link$artPoster) %>%
arrange(desc(type))
# 建立網路關係
reviewNetwork <- graph_from_data_frame(d=link, v=filtered_user, directed=F)
# 依據使用者身份對點進行上色
labels <- degree(reviewNetwork)
V(reviewNetwork)$label <- names(labels)
V(reviewNetwork)$color <- ifelse(V(reviewNetwork)$type=="poster", "gold", "lightblue")
# 依據回覆發生的文章所對應的主題,對他們的關聯線進行上色
E(reviewNetwork)$color <- ifelse(E(reviewNetwork)$cmtStatus == "推", "lightgreen", "palevioletred")
# 畫出社群網路圖
set.seed(5432)
plot(reviewNetwork, vertex.size=8, edge.width=3, vertex.label.dist=1,
vertex.label=ifelse(degree(reviewNetwork) > filter_degree, V(reviewNetwork)$label, NA),vertex.label.font=2)
# 加入標示
legend("bottomright", c("發文者","回文者"), pch=21,
col="#777777", pt.bg=c("gold","lightblue"), pt.cex=1, cex=1)
legend("topleft", c("推","噓"),
col=c("lightgreen","palevioletred"), lty=1, cex=1)關於主題分類,依據perplexity作為分類依據,分8個主題是最低的,次低的是分2個主題,但因為主題8之字詞重複性太高,因此決定將主題分為次低的2個主題,分別是主題一:優惠、服務、消息和主題二:商品討論。
從文章挑選討論數最高的2019/1/9作為分析的代表日,經查看,這天大多數的意見領袖是在討論福袋的開箱。而在整體文章的部分,kaojet,討論全家戽斗星球磁鐵公仔,B9560020,7-11揪團活動,zzz41432,討論7-11Kitty&LINE聯名造型杯壺組,依上述意見領袖的討論內容,可見聯名的活動很受消費者歡迎。