## Warning in Sys.setlocale(category = "LC_ALL", locale = "zh_TW.UTF-8"): 作業系統
## 回報無法實現設定語區為 "zh_TW.UTF-8" 的要求
## [1] ""
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 4.0.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tidytext
## Warning: package 'tidytext' was built under R version 4.0.4
## Loading required package: jiebaR
## Warning: package 'jiebaR' was built under R version 4.0.4
## Loading required package: jiebaRD
## Warning: package 'jiebaRD' was built under R version 4.0.4
## Loading required package: gutenbergr
## Warning: package 'gutenbergr' was built under R version 4.0.4
## Loading required package: stringr
## Loading required package: wordcloud2
## Warning: package 'wordcloud2' was built under R version 4.0.4
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 4.0.4
## Loading required package: scales
## Loading required package: data.table
## Warning: package 'data.table' was built under R version 4.0.3
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
Gutenberg free eBooks
Also, various chinese books can be found in the link below:
三國志 by Chen Shou:
threekt <- gutenberg_download(25606, mirror = "http://mirrors.xmission.com/gutenberg/")%>% filter(text!="") %>% distinct(gutenberg_id, text)
#匯入三國志 以魏、吳書做為章節的分類依據,如魏書一、吳書一。
## # A tibble: 50 x 3
## gutenberg_id text chapter
## <int> <chr> <int>
## 1 25606 "魏書一 武帝紀第一" 1
## 2 25606 " 太祖武皇帝,沛國譙人也,姓曹,諱操,字孟德,漢相國參之後。〔曹瞞傳曰〕:太祖一名吉利,小字阿瞞。王沈魏書~ 1
## 3 25606 " 太祖少機警,有權數,而任俠放蕩,不治行業,故世人未之奇也;曹瞞傳雲:太祖少好飛鷹走狗,遊蕩無度,其叔父數~ 1
## 4 25606 " 光和末,黃巾起。拜騎都尉,討潁川賊。遷為濟南相,國有十餘縣,長吏多阿附貴戚,贓汙狼藉,於是奏免其八;禁斷~ 1
## 5 25606 " 頃之,冀州刺史王芬、南陽許攸、沛國周旌等連結豪傑,謀廢靈帝,立合肥侯,以告太祖,太祖拒之。芬等遂敗。司馬~ 1
## 6 25606 " 金城邊章、韓遂殺刺史郡守以叛,眾十餘萬,天下騷動。徵太祖為典軍校尉。會靈帝崩,太子即位,太后臨朝。大將軍~ 1
## 7 25606 " 初平元年春正月,後將軍袁術、冀州牧韓馥、英雄記曰:馥字文節,潁川人。為禦史中丞。董卓舉為冀州牧。于時冀州~ 1
## 8 25606 " 二月,卓聞兵起,乃徙天子都長安。卓留屯洛陽,遂焚宮室。是時紹屯河內,邈、岱、瑁、遺屯酸棗,術屯南陽,<U+4F37>屯~ 1
## 9 25606 " 太祖到酸棗,諸軍兵十餘萬,日置酒高會,不圖進取。太祖責讓之,因為謀曰:「諸君聽吾計,使勃海引河內之眾臨孟~ 1
## 10 25606 " 太祖兵少,乃與夏侯惇等詣揚州募兵,刺史陳溫、丹楊太守周昕與兵四千餘人。還到龍亢,士卒多叛。魏書曰:兵謀叛~ 1
## # ... with 40 more rows
# 設定斷詞function
threekt_tokenizer <- function(t) {
lapply(t, function(x) {
tokens_threekt <- segment(x, jieba_tokenizer)
return(tokens_threekt)
})
}## tibble [202,888 x 3] (S3: tbl_df/tbl/data.frame)
## $ gutenberg_id: int [1:202888] 25606 25606 25606 25606 25606 25606 25606 25606 25606 25606 ...
## $ chapter : int [1:202888] 1 1 1 1 1 1 1 1 1 1 ...
## $ word : chr [1:202888] "魏書" "一" " " " " ...
## # A tibble: 500 x 3
## gutenberg_id chapter word
## <int> <int> <chr>
## 1 25606 1 魏書
## 2 25606 1 一
## 3 25606 1
## 4 25606 1
## 5 25606 1 武帝紀
## 6 25606 1 第一
## 7 25606 1
## 8 25606 1
## 9 25606 1 太祖
## 10 25606 1 武
## # ... with 490 more rows
## # A tibble: 30 x 2
## word sum
## <chr> <int>
## 1 將軍 845
## 2 太祖 610
## 3 太守 416
## 4 天下 382
## 5 不能 287
## 6 陛下 204
## 7 不可 203
## 8 大將軍 201
## 9 天子 192
## 10 於是 180
## # ... with 20 more rows
在此處可以看到,將軍、太祖、太守出現頻率最高,因是三國志當中皆以軍位做為代稱,所以將軍字詞出現的頻率才會這麼高。
而三國志是以曹魏作為主要的史觀,當中的太祖指的是太祖武皇帝-曹操,曹操為曹魏的主要出現人物,因此出現頻率較高。
太守為現在的地方官,出現頻率第三高,推測是在規劃作戰時經常提到的人物。
以三國當中最主要的領袖做為元素,分析他們在各章節出現的次數,分別是曹操(魏)、劉備(蜀)、孫權(吳)
tsao_tokens <- tokens_threekt %>%
filter(.$word == "太祖" | .$word == "曹操"| .$word == "孟德") %>% #聯集
group_by(chapter) %>%
summarise(count = n()) %>%
mutate(word = "曹操")
liu_tokens <- tokens_threekt %>%
filter(.$word == "劉備" | .$word == "玄德"| .$word == "先主") %>% #聯集
group_by(chapter) %>%
summarise(count = n()) %>%
mutate(word = "劉備")
sun_tokens <- tokens_threekt %>%
filter(.$word == "孫權" | .$word == "仲謀") %>% #聯集
group_by(chapter) %>%
summarise(count = n()) %>%
mutate(word = "孫權")bind_rows(tsao_tokens, liu_tokens, sun_tokens) %>%
ggplot(aes(x = chapter, y=count, fill=word)) +
geom_bar(stat = "identity", width = 0.5)+
geom_col(show.legend = F) +
facet_wrap(~word, ncol = 1) +
ggtitle("三國志主要人物各章出現次數") +
xlab("章節") +
ylab("出現次數")+
scale_x_continuous(breaks = seq(0, 20, by = 5))
從上圖可以看到,曹操出場的頻率比起其他兩位主公都來得多,而孫權則是在最後一章出現較多次,劉備的出場則分布在第一章與最後一章,而這與三國志以曹魏作為史觀的前提相符合。
bind_rows(threekt %>%
group_by(chapter) %>%
summarise(count = n(), type="sentences"),
tokens_threekt %>%
group_by(chapter) %>%
summarise(count = n(), type="words")) %>%
group_by(type)%>%
ggplot(aes(x = chapter, y=count, fill="type", color=factor(type))) +
geom_line() +
ggtitle("各章節的句字總數") +
xlab("章節") +
ylab("句字數量")
可以看到第20章的字數相當多,但因數量龐大,相對的句子的數量也就不容易判別
threekt %>%
group_by(chapter) %>%
summarise(count = n(), type="sentences")%>%
ggplot(aes(x = chapter, y=count, fill="type")) +
geom_line(col="red") +
ggtitle("各章節的句子總數") +
xlab("章節") +
ylab("句子數量")
經上圖可以看到第20章的句子數量也是最多的,經查看,20章為吳書十 程黃韓蔣周陳董甘淩徐潘丁傳第十,細目為程普傳、黃蓋傳、韓當傳、蔣欽傳、周泰傳、陳武傳、董襲傳、甘寧傳、凌統傳、徐盛傳、潘璋傳、丁奉傳,指的是三國時代孫吳下的將領,陳壽讚稱這些吳將為「江東十二虎臣」。