社群媒體分析_期中報告

主題

倪匡小說:「偷天換日」

動機和分析目的

以本課習得之文字分析技巧，分析倪匡擅長的短篇小說，是否能反映故事情節及架構。本文主角「賽觀音」為組織(中共)極高層遺孀，在97歲臨終前欲託咐一個隱暪了一生的大秘密。在抗戰時中共組織為確保代代有菁英人才，早有指定第2、第3代接班人的培訓計畫。在流亡期間將高幹子女秘密集中托養，將安排送出國深造，學成後安排國內各層面歷練，最後依表現成為接班人選之一。賽觀音負責其中年齡段為2~3歲組，一共有203位幼童，並隱匿在某罕見山區，怎料突然半夜土石流將女兵及孩童所在的古廟無情沖走不留痕跡，只剩賽觀音一人獨存。賽觀意在無奈上吊謝罪之際，被土匪頭軍師娘子救下…

1.為何賽觀音最後可以不死？為何組織文件沒有記載？

2.賽觀音的大秘密與什麼有關？

1.資料取得與套件載入

1.1 安裝及載入套件

packages = c("dplyr", "tidytext","jiebaR","stringr","wordcloud","wordcloud2","ggplot2", "tidyr", "scales","curl","readr","NLP","ggraph","igraph","reshape2","widyr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

require(dplyr)
require(tidytext)
require(jiebaR)
require(stringr)
require(wordcloud)
require(wordcloud2)
require(ggplot2)
require(tidyr)
require(scales)
require(curl)
require(readr)
require(NLP)
require(ggraph)
require(igraph)
require(reshape2)
require(widyr)

1.2 載入「偷天換日」文字檔

orginalDF = read.table("steal.txt",header= F , sep="\n", fileEncoding='UTF-8' , stringsAsFactors=F)

2.處理斷行與斷詞

2.1 處理斷行

steal_vector <- unlist(strsplit(orginalDF$V1,"[，。？！]"), use.names=FALSE)
#處理斷行 
chapter_steal <- data_frame(text=steal_vector)  %>%
  mutate(line=c(1:nrow(.))) %>% 
  mutate(chapter = cumsum(str_detect(.$text, regex("第.*部："))))
#處理章節
chapter_steal

## # A tibble: 6,906 x 3
##    text                        line chapter
##    <chr>                      <int>   <int>
##  1 ︻第一部：美女︼               1       1
##  2 那天晚上                       2       1
##  3 和白素在外面忙了一天回家       3       1
##  4 車子停在門口                   4       1
##  5 白素先進屋子                   5       1
##  6 我將車子停好一些               6       1
##  7 就聽到屋子裡傳來紅綾的叫聲     7       1
##  8 紅綾一面叫                     8       1
##  9 一面還在說些甚麼               9       1
## 10 可是在因為聲響太吵耳          10       1
## # ... with 6,896 more rows

2.2 處理斷詞

#新增結巴處理器，載入停用字、保留字詞庫
#反覆調整保留字及停用字stop_words 整理 (ex: 於是)

jieba_tokenizer <- worker(stop_word = "stop_words.txt",user="user_dict.txt")

#定義丟給unnest_tokens的分詞
book_tokenizer <- function(t) {
  lapply(t, function(x) {
    tokens <- segment(x, jieba_tokenizer)
    # 將詞彙長度為1的詞清除
    tokens <- tokens[nchar(tokens)>1]
    return(tokens)
  })
}
tidybook <- chapter_steal %>% unnest_tokens(word,text,token= book_tokenizer)

2.3 查看字頻>50以上的詞庫

主要是人物及名稱

tidybook %>%
  count(word, sort = TRUE) %>%
  filter(n > 50) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  ylab("出現次數") +
  coord_flip()

為什麼[一口氣]如此常見? =>倪匡小說的特點，喜誇大的描寫人物的反應

grep(".*一口氣$", steal_vector,value=T) %>% head (50)

##  [1] "我吸了一口氣"                            
##  [2] "白素咦了一口氣"                          
##  [3] "嘆了一口氣"                              
##  [4] "客人嘆了一口氣"                          
##  [5] "吸了一口氣"                              
##  [6] "我深深地吸了一口氣"                      
##  [7] "於是深深地吸了一口氣"                    
##  [8] "她就嘆了一口氣"                          
##  [9] "吸了一口氣"                              
## [10] "他吸了一口氣"                            
## [11] "大大地鬆了一口氣"                        
## [12] "雖然當時我只是吸了一口氣"                
## [13] "賽觀音緩緩地吸了一口氣"                  
## [14] "嘆了一口氣"                              
## [15] "於是就深深地吸了一口氣"                  
## [16] "我不由自主嘆了一口氣"                    
## [17] "賽觀音深深地吸了一口氣"                  
## [18] "白素則陡然吸了一口氣"                    
## [19] "她深深地吸了一口氣"                      
## [20] "賽觀音輕輕嘆了一口氣"                    
## [21] "她深深地吸了一口氣"                      
## [22] "長長地嘆了一口氣"                        
## [23] "我吸了一口氣"                            
## [24] "賽觀音才長長吁了一口氣"                  
## [25] "吸了一口氣"                              
## [26] "她輕輕嘆了一口氣"                        
## [27] "賽觀音長長地吸了一口氣"                  
## [28] "人人都鬆了一口氣"                        
## [29] "軍師娘子吸了一口氣"                      
## [30] "軍師娘子緩緩吸了一口氣"                  
## [31] "她吸了一口氣"                            
## [32] "於是吸了一口氣"                          
## [33] "在這時候我和白素不約而同深深地吸了一口氣"
## [34] "這倒令我鬆了一口氣"                      
## [35] "我當時輕輕嘆了一口氣"                    
## [36] "急速地吸了一口氣"                        
## [37] "白素嘆了一口氣"                          
## [38] "我深深吸了一口氣"

2.4 出現30~50字頻的詞庫

=>此段為小說主體關鍵字，反映出內容是描述與與土匪、部隊有關的歷史過往

tidybook %>%
  count(word, sort = TRUE) %>%
  filter(n > 30 & n < 50) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  ylab("出現次數") +
  coord_flip()

3. 文字雲

tokens_count  <- tidybook %>% 
  filter(nchar(.$word)>1) %>%
  group_by(word) %>% 
  summarise(sum = n()) %>% 
  filter(sum>10) %>%
  arrange(desc(sum)) 
tokens_count %>% wordcloud2()

4. 情緒分析 LIWC

4.1 自編情緒自典

steal_positive.txt、steal_negative.txt

p<-read_file(file.path (getwd() , "steal_positive.txt")) 
n<-read_file(file.path (getwd()  , "steal_negative.txt"))
positive <- strsplit(p, "[,]")[[1]]
negative <- strsplit(n, "[,]")[[1]]
positive <- data.frame(word = positive, sentiments = "positive")
negative <- data.frame(word = negative, sentiemtns = "negative")
colnames(negative) = c("word","sentiment")
colnames(positive) = c("word","sentiment")
LIWC_ch <- rbind(positive, negative)
head(LIWC_ch, 30)

##        word sentiment
## 1      一流  positive
## 2  下定決心  positive
## 3  不拘小節  positive
## 4    不費力  positive
## 5      不錯  positive
## 6      主動  positive
## 7      乾杯  positive
## 8      乾淨  positive
## 9    了不起  positive
## 10     享受  positive
## 11     仁心  positive
## 12     仁愛  positive
## 13     仁慈  positive
## 14     仁義  positive
## 15     仁術  positive
## 16     仔細  positive
## 17     付出  positive
## 18     伴侶  positive
## 19     伶俐  positive
## 20     作品  positive
## 21     依戀  positive
## 22     俊美  positive
## 23     俐落  positive
## 24     保證  positive
## 25     保護  positive
## 26     信任  positive
## 27     信奉  positive
## 28     信實  positive
## 29     信心  positive
## 30     信服  positive

4.2 分析各章節情緒差值

計算每個章節情緒值採用LIWC的LEXICON

calsentiment <-tidybook %>%
  inner_join(LIWC_ch) %>%
  count(chapter = chapter, sentiment)%>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentimentx = positive - negative)

head(calsentiment)

## # A tibble: 6 x 4
##   chapter negative positive sentimentx
##     <int>    <dbl>    <dbl>      <dbl>
## 1       1       46      165        119
## 2       2       96       74        -22
## 3       3       54      109         55
## 4       4       72       49        -23
## 5       5      104       63        -41
## 6       6       72       59        -13

ggplot(calsentiment, aes(chapter, sentimentx,fill = chapter)) +
  geom_col(show.legend = FALSE,width = 0.8) +
  scale_x_continuous(name = "Chapter",breaks = c(1,2,3,4,5,6,7,8,9,10),
                        minor_breaks = NULL) +
  ylab("情緒差值")+
  theme_grey(base_family = 'STKaiti')

4.3 參照各章節標題與情緒之相應

可得知這是一篇傾向負向情緒的文章，符合倪匡在本文所傳達的謊言與自私意念。

gsub("︼",")",gsub("︻","(",grep(".*︼$", steal_vector,value=T) ))

##  [1] "(第一部：美女)" "(第二部：麻木)" "(第三部：重逢)" "(第四部：歷史)"
##  [5] "(第五部：烙印)" "(第六部：男女)" "(第七部：計劃)" "(第八部：巨災)"
##  [9] "(第九部：深究)" "(第十部：證據)"

#grep(".*歷史.*", steal_vector,value=T)
#grep(".*土匪.*", steal_vector,value=T)
#grep(".*部隊.*", steal_vector,value=T)
#grep(".*問題.*", steal_vector,value=T)
#grep(".*敘述.*", steal_vector,value=T)

4.4 小說使用到情緒詞彙

依據tokens_count統計的文字，inner_join情緒字典，查出本書所用到的情緒字眼

result_sentiment <- tokens_count %>% 
  select(word) %>%
  inner_join(LIWC_ch) 

result_sentiment

## # A tibble: 31 x 2
##    word  sentiment
##    <chr> <chr>    
##  1 問題  negative 
##  2 相信  positive 
##  3 搖頭  negative 
##  4 重要  positive 
##  5 漂亮  positive 
##  6 漂亮  positive 
##  7 可怕  negative 
##  8 肯定  positive 
##  9 決定  positive 
## 10 美麗  positive 
## # ... with 21 more rows

4.5 製作情緒關鍵字分析

將書中的情緒關鍵字做分析

library("reshape2")
par(family="STKaiti")
tokens_count %>% 
  inner_join(LIWC_ch) %>%
  select(word,sentiment,sum) %>%
  acast(word ~ sentiment,value.var = "sum", fill = 0) %>% 
  wordcloud::comparison.cloud(random.order=FALSE,colors = c("indianred3", "blue"),max.words = 108)

5. 製作TF-IDF

#斷詞與整理斷詞結果
# 進行斷詞，並計算各詞彙在各文章中出現的次數
steal_words <- tidybook %>%
  count(chapter, word, sort = TRUE)
steal_words

## # A tibble: 8,850 x 3
##    chapter word       n
##      <int> <chr>  <int>
##  1       6 賽觀音   109
##  2       6 於放      94
##  3       8 賽觀音    90
##  4       7 賽觀音    86
##  5       4 賽觀音    83
##  6       3 葫蘆生    74
##  7       1 紅綾      72
##  8       7 於放      71
##  9       9 於是      62
## 10       9 白素      58
## # ... with 8,840 more rows

# 計算每篇文章包含的詞數
total_words <- tidybook %>% 
  group_by(chapter) %>%
  summarise(sum = n())
total_words

## # A tibble: 10 x 2
##    chapter   sum
##      <int> <int>
##  1       1  1531
##  2       2  1510
##  3       3  1586
##  4       4  1501
##  5       5  1679
##  6       6  1669
##  7       7  1785
##  8       8  1597
##  9       9  1462
## 10      10  1373

# 合併 mask_words（每個詞彙在每個文章中出現的次數）
# 與 total_words（每篇文章的詞數）
# 新增各個詞彙在所有詞彙中的總數欄位
steal_words <- left_join(steal_words, total_words)
steal_words

## # A tibble: 8,850 x 4
##    chapter word       n   sum
##      <int> <chr>  <int> <int>
##  1       6 賽觀音   109  1669
##  2       6 於放      94  1669
##  3       8 賽觀音    90  1597
##  4       7 賽觀音    86  1785
##  5       4 賽觀音    83  1501
##  6       3 葫蘆生    74  1586
##  7       1 紅綾      72  1531
##  8       7 於放      71  1785
##  9       9 於是      62  1462
## 10       9 白素      58  1462
## # ... with 8,840 more rows

5.1 計算 TF-IDF

以每篇文章爲單位，計算每個詞彙在的tf-idf值 =>疑問: 為何不見賽觀音?

steal_words_tf_idf <- steal_words %>%
  bind_tf_idf(word, chapter, n) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(chapter) %>%
  top_n(7) %>%
  ungroup %>%
   arrange(desc(tf_idf))
steal_words_tf_idf

## # A tibble: 74 x 7
##    chapter word         n   sum      tf   idf tf_idf
##      <int> <fct>    <int> <int>   <dbl> <dbl>  <dbl>
##  1       1 紅綾        72  1531 0.0470  0.916 0.0431
##  2       8 軍師娘子    37  1597 0.0232  1.20  0.0279
##  3       3 葫蘆生      74  1586 0.0467  0.511 0.0238
##  4      10 文件        14  1373 0.0102  2.30  0.0235
##  5       7 日軍        17  1785 0.00952 2.30  0.0219
##  6       3 藍絲        22  1586 0.0139  1.20  0.0167
##  7      10 紅綾        24  1373 0.0175  0.916 0.0160
##  8       9 黃蟬        10  1462 0.00684 2.30  0.0157
##  9       1 介紹信      10  1531 0.00653 2.30  0.0150
## 10       3 患者        14  1586 0.00883 1.61  0.0142
## # ... with 64 more rows

5.2 顯示tf-idf 長條圖

各章節的高字頻詞彙

steal_words_tf_idf %>%
  ggplot(aes(word, tf_idf, fill = chapter)) +
  geom_col(show.legend = FALSE) +
  labs(x = "chapter", y = "tf-idf") +
  facet_wrap(~chapter, scales = "free") +
  coord_flip()

6. Word Correlation

6.1 計算兩個詞彙同時出現的總次數

word_pairs <- steal_words %>%
  pairwise_count(word, chapter, sort = TRUE)
word_pairs

## # A tibble: 6,734,594 x 3
##    item1  item2      n
##    <chr>  <chr>  <dbl>
##  1 於是   賽觀音    10
##  2 一口氣 賽觀音    10
##  3 神情   賽觀音    10
##  4 回答   賽觀音    10
##  5 想到   賽觀音    10
##  6 問題   賽觀音    10
##  7 一眼   賽觀音    10
##  8 人物   賽觀音    10
##  9 告訴   賽觀音    10
## 10 許多   賽觀音    10
## # ... with 6,734,584 more rows

6.2 計算兩個詞彙間的相關性

word_cors <- steal_words %>%
  group_by(word) %>%
  filter(n() >= 7) %>%
  pairwise_cor(word, chapter, sort = TRUE)
word_cors

## # A tibble: 17,030 x 3
##    item1 item2 correlation
##    <chr> <chr>       <dbl>
##  1 容易  於放           1.
##  2 興趣  白素           1.
##  3 女兒  母親           1.
##  4 自然  母親           1.
##  5 變成  敘述           1.
##  6 改變  敘述           1.
##  7 環境  離開           1.
##  8 原來  離開           1.
##  9 過去  離開           1.
## 10 身上  離開           1.
## # ... with 17,020 more rows

“關鍵字”相關性高的詞彙，但是都是Na?

word_cors %>%
  filter(item1 == "組織") %>% 
  head(100)

## # A tibble: 100 x 3
##    item1 item2 correlation
##    <chr> <chr>       <dbl>
##  1 組織  土匪        0.764
##  2 組織  敘述        0.764
##  3 組織  經歷        0.764
##  4 組織  變成        0.764
##  5 組織  改變        0.764
##  6 組織  故事        0.667
##  7 組織  離開        0.667
##  8 組織  環境        0.667
##  9 組織  原來        0.667
## 10 組織  過去        0.667
## # ... with 90 more rows

6.3 分別尋找與 “母親”, “組織”,“秘密”,“關係”,“土匪”

這五個相關性最高詞彙

word_cors %>%
  filter(item1 %in% c("母親", "組織","秘密","關係","土匪")) %>%
  group_by(item1) %>%
  top_n(5) %>%
  ungroup() %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()+ 
  theme(text = element_text(family = "Heiti TC Light")) #加入中文字型設定，避免中文字顯示錯誤。

6.4 繪製共現相關圖

從拓樸中可以與文章的主要架構相符，從對母親的負面情緒推衍到過去的秘密。

set.seed(2020)

word_cors %>%
  filter(correlation > 0.4) %>%
  filter(item1 %in% c("母親", "組織","秘密","關係","土匪")) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") + #加入中文字型設定，避免中文字顯示錯誤。
  theme_void()

7.結論

倪匡的小說從書名可以知道其結論，其使用文字不會有太複雜與獨特的詞彙。文章前半段會舖陳情節，結局相關在後面段會反覆出現，從高頻TF-IDF 與故事情節非常吻合。

從文字分析的過程中，透過對書的情節了解可以相互印證其分析大致相符。從主要的關鍵字之間的共現關係，也可以描繪出具體的文章走向。

作者雖運用大量的口語描述，體現人物間細微的互動，例如輕輕地，從容地等，這些因為與內容無關，都被停用字過濾了，因此真正與故事描述相關的文字，大約只剩全文的 2/3。

對故事理解與文字分析相印證：

1.為何賽觀音最後未死，且組織文件沒有記載？

機密文件雖然存在很多疑團，整件事只由賽觀音報告，沒有其他人佐証，無法找出那群逃亡的幹部和女兵，所以當組織收到這報告，選擇相信，不再繼續調查。因為，若在調查中找出真相，所有相關決定及執行的人員，都有可能在政治風暴中被人批鬥，倒不如當一切都正常，以免有把柄落在其他人手中。所以，歷史不一定全由當權者決定，當中的官員也是有份的。

2.她女兒拿出所謂的文件試圖證明母親說的都是妄想，母女間的情感關係為何?

故事的開端對女兒的描寫頗為正向，而到中後章節，敘述母女間的關係是負面與緊張的。女兒對土匪出身的母親極其反感，卻崇拜根正苗紅的父親。雖然沒有和母親劃清界線，但在日常生活上兩人的不同思想，已產生不少爭吵。貌美女兒生在動亂時代，必受迫害，最後能回復一定的地位，自然對現在的組織有依賴之心，亦害怕再受到逼害，因此拒信母親所言。

3.賽觀音真正想表達的那個大秘密是什麼？與作者的隱喻有何關係?

隱喻某政權崛起、立國初期所抱著「為人民服務」的理念，吸引眾多有志青年擁戴，但後來為什麼會發生那翻天覆地的十年(文化大革命)？或許因為領袖已經不是原來的領袖，菁英的第二代已經不是原來的第二代，而是跟土匪有關。當然，這批203位不乏現今在位的知名政治人物。

社群媒體分析_期中報告

第18組劉宜銘、卓怡姍、劉純妤、徐梓竣

2021/5/4

主題

動機和分析目的

1.資料取得與套件載入

1.1 安裝及載入套件

1.2 載入「偷天換日」文字檔

2.處理斷行與斷詞

2.1 處理斷行

2.2 處理斷詞

2.3 查看字頻>50以上的詞庫

2.4 出現30~50字頻的詞庫

3. 文字雲

4. 情緒分析 LIWC

4.1 自編情緒自典

4.2 分析各章節情緒差值

4.3 參照各章節標題與情緒之相應

4.4 小說使用到情緒詞彙

4.5 製作情緒關鍵字分析

5. 製作TF-IDF

5.1 計算 TF-IDF

5.2 顯示tf-idf 長條圖

6. Word Correlation

6.1 計算兩個詞彙同時出現的總次數

6.2 計算兩個詞彙間的相關性

6.3 分別尋找與 “母親”, “組織”,“秘密”,“關係”,“土匪”

6.4 繪製共現相關圖

7.結論

社群媒體分析_期中報告

第18組 劉宜銘、卓怡姍、劉純妤、徐梓竣

2021/5/4

主題

動機和分析目的

1.資料取得與套件載入

1.1 安裝及載入套件

1.2 載入「偷天換日」文字檔

2.處理斷行與斷詞

2.1 處理斷行

2.2 處理斷詞

2.3 查看字頻>50以上的詞庫

2.4 出現30~50字頻的詞庫

3. 文字雲

4. 情緒分析 LIWC

4.1 自編情緒自典

4.2 分析各章節情緒差值

4.3 參照各章節標題與情緒之相應

4.4 小說使用到情緒詞彙

4.5 製作情緒關鍵字分析

5. 製作TF-IDF

5.1 計算 TF-IDF

5.2 顯示tf-idf 長條圖

6. Word Correlation

6.1 計算兩個詞彙同時出現的總次數

6.2 計算兩個詞彙間的相關性

6.3 分別尋找與 “母親”, “組織”,“秘密”,“關係”,“土匪”

6.4 繪製共現相關圖

7.結論

第18組劉宜銘、卓怡姍、劉純妤、徐梓竣