1 Data介紹
- Data Source: 辛普森一家 字幕庫
- 第31季第12集~第16集(共5集)
- 中英皆有(此例用的是中文繁體字幕)
1.1 斷詞系統
- Jieba(結巴)
1.2 情緒字典
- 中文: LIWC(本例使用)
2 文本探索
2.1 系統參數設定
## [1] "zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/en_US.UTF-8"
2.2 讀取資料
2.3 文字前處理
2.3.1 初始化斷詞引擎
2.3.2 查看斷詞結果
## [1] "第十二集" "晚上好" "霍雷肖" "剩下的" "給我" "田鎮"
## [7] "尋寶之地" "說" "尋寶之地" "這周" "找到" "沉船"
## [13] "四十年" "深度" "米" "聲吶" "探測" "無果"
## [19] "繼續" "下" "一個" "讀數" "深度" "米"
## [25] "聲吶" "探測" "無果" "船長" "長" "范圍"
## [31] "多普勒" "雷達" "顯示" "暴風" "正" "東北方"
## [37] "過來" "最好" "掉頭" "回去" "風暴" "過去"
## [43] "猜猜" "尋到" "寶" "醫生" "說" "增加"
## [49] "藥量" "事" "風暴" "攪亂" "海底" "從斷"
## [55] "脊" "南面" "重新" "探測" "坐標" "深度"
## [61] "米" "聲吶" "探測" "無果" "做" "烤肉"
## [67] "晚飯" "丟掉" "抱歉" "一直" "沒能給" "一兒"
## [73] "半女" "深度" "米" "聲吶" "探測" "第二聲"
## [79] "回響" "第二聲" "回響" "大半輩子" "說" "這句"
## [85] "潛水員" "裝備" "找到" "終于" "找到" "伙計們"
## [91] "浪費" "一生" "市長" "驚" "嘆" "戰利品"
## [97] "戰利品" "屬于" "小鎮" "政府"
2.3.3 動態新增自訂詞彙/停用字
## [1] TRUE
## [1] "第十二集" "晚上好" "霍雷肖" "剩下的" "給我" "田鎮"
## [7] "尋寶之地" "說" "尋寶之地" "這周" "找到" "沉船"
## [13] "四十年" "深度" "米" "聲吶" "探測" "無果"
## [19] "繼續" "下一個" "讀數" "深度" "米" "聲吶"
## [25] "探測" "無果" "船長" "長" "范圍" "多普勒"
## [31] "雷達" "顯示" "暴風" "正" "東北方" "過來"
## [37] "最好" "掉頭" "回去" "風暴" "過去" "猜猜"
## [43] "尋到" "寶" "醫生" "說" "增加" "藥量"
## [49] "事" "風暴" "攪亂" "海底" "從斷" "脊"
## [55] "南面" "重新" "探測" "坐標" "深度" "米"
## [61] "聲吶" "探測" "無果" "做" "烤肉" "晚飯"
## [67] "丟掉" "抱歉" "一直" "沒能給" "一兒" "半女"
## [73] "深度" "米" "聲吶" "探測" "第二聲" "回響"
## [79] "第二聲" "回響" "大半輩子" "說" "這句" "潛水員"
## [85] "裝備" "找到" "終于" "找到" "伙計們" "浪費"
## [91] "一生" "市長" "驚嘆" "戰利品" "戰利品" "屬于"
## [97] "小鎮" "政府" "管轄" "范圍"
simps = filter_segment(tokens, c("晚上好", "這周", "正", "成", "一個", "再")) # 篩除停用字
segment(simps, jieba_tokenizer) %>% head(100)## [1] "第十二集" "霍雷肖" "剩下的" "給我" "田鎮" "尋寶之地"
## [7] "說" "尋寶之地" "找到" "沉船" "四十年" "深度"
## [13] "米" "聲吶" "探測" "無果" "繼續" "下一個"
## [19] "讀數" "深度" "米" "聲吶" "探測" "無果"
## [25] "船長" "長" "范圍" "多普勒" "雷達" "顯示"
## [31] "暴風" "東北方" "過來" "最好" "掉頭" "回去"
## [37] "風暴" "過去" "猜猜" "尋到" "寶" "醫生"
## [43] "說" "增加" "藥量" "事" "風暴" "攪亂"
## [49] "海底" "從斷" "脊" "南面" "重新" "探測"
## [55] "坐標" "深度" "米" "聲吶" "探測" "無果"
## [61] "做" "烤肉" "晚飯" "丟掉" "抱歉" "一直"
## [67] "沒能給" "一兒" "半女" "深度" "米" "聲吶"
## [73] "探測" "第二聲" "回響" "第二聲" "回響" "大半輩子"
## [79] "說" "這句" "潛水員" "裝備" "找到" "終于"
## [85] "找到" "伙計們" "浪費" "一生" "市長" "驚嘆"
## [91] "戰利品" "戰利品" "屬于" "小鎮" "政府" "管轄"
## [97] "范圍" "發現" "非" "建制"
2.3.4 標註集數
2.3.5 文字雲視覺化
tokens_count = simps_ep %>%
group_by(word) %>%
summarise(count = n()) %>%
filter(count > 10) %>%
arrange(desc(count))
head(tokens_count, 10)## # A tibble: 10 x 2
## word count
## <chr> <int>
## 1 巴仔 31
## 2 學校 26
## 3 劇透 24
## 4 喜歡 24
## 5 時間 23
## 6 電影 20
## 7 工作 20
## 8 貨幣 19
## 9 教授 18
## 10 臨客 18
3 情緒分析
3.1 準備字典
p = read_file("./liwc/positive.txt")
n = read_file("./liwc/negative.txt")
positive = data.frame(word = strsplit(p, "[,]")[[1]], sentiment = "positive", stringsAsFactors = FALSE)
negative = data.frame(word = strsplit(n, "[,]")[[1]], sentiment = "negative", stringsAsFactors = FALSE)
LIWC_ch = rbind(positive, negative)
LIWC_ch %>% head(10)## word sentiment
## 1 一流 positive
## 2 下定決心 positive
## 3 不拘小節 positive
## 4 不費力 positive
## 5 不錯 positive
## 6 主動 positive
## 7 乾杯 positive
## 8 乾淨 positive
## 9 了不起 positive
## 10 享受 positive
3.2 以LIWC字典判斷辛普森一家台詞的情緒傾向
## Joining, by = "word"
## # A tibble: 2 x 2
## sentiment cnt
## <chr> <int>
## 1 negative 222
## 2 positive 316
3.3 分析每集情緒
plot_table = simps_ep %>%
group_by(episode, word) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
inner_join(LIWC_ch) %>%
group_by(episode, sentiment) %>%
summarise(cnt = sum(count))## Joining, by = "word"
plot_table %>% ggplot() +
geom_line(aes(x=episode, y=cnt, colour=sentiment)) +
ggtitle("每集情緒差異") +
theme(text=element_text(family="蘋方-繁 中黑體", size=12),
plot.title=element_text(hjust = 0.5))- 基本上皆為正面>負面,與我們平時認知的辛普森家庭稍微有些不同,我解釋原因為:
- 辛普森雖然給人帶有嘲諷的印象,但是畢竟還是喜劇,光從詞彙意義上來判斷,很容易被歸類為正面用語
- 字典定義影響,例如:在恐怖電影/遊戲產業專用的字典裡,「恐怖」、「嚇人」就不算負面用字,而是符合需求的正面詞彙。
- 第13集中,正面與負面差距最大,可能是五集中最歡樂的一集
- 第14集中,正面與負面差距最小,是較為中性的一集
3.4 分析第13集中的正負面詞彙
simps_ep %>%
group_by(episode, word) %>%
summarise(count = n()) %>%
filter(episode==13) %>%
inner_join(LIWC_ch) %>%
group_by(word, sentiment) %>%
summarise(count=sum(count)) %>%
filter(count > 1) %>% # 只有1次的太多了因此濾掉
acast(word ~ sentiment, value.var = "count", fill = 0) %>%
comparison.cloud(colors = c("gray80", "gray20"),
max.words = 100, family = "蘋方-繁 中黑體")## Joining, by = "word"