根據臺灣內政部人口統計顯示,2020年第4季台灣首度出現了人口「生死交叉」,於今年2021年第1季,更是相較於2020第4季減少13.6%的人口,臺灣目前正持續呈現人口負成長狀態,這是一個非常嚴重的國安問題。 因此我們就在想,到底造成現在年輕人不想生孩子的主要原因到底是什麼,因此我們先假設幾個前提:
- 跟低薪資有關
- 跟高房價有關
- 跟教、養、育有關
資料的時間區間:2020/4/1-2021/4/1 資料的關鍵字:生小孩或生孩子 網站來源:PPT 抓取資料主題或範圍:八卦版文章與網友留言、女孩版網友留言與婚姻版網友留言。
避免中文亂碼(Windows系統可將這行註解)之處理
## [1] ""
安裝需要的packages與載入
#安裝
packages = c("readr", "dplyr", "stringr", "jiebaR", "tidytext", "NLP", "readr", "tidyr", "ggplot2", "ggraph", "igraph", "scales", "reshape2", "widyr","data.table","wordcloud2")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
#載入
library(readr)
library(dplyr)
library(stringr)
library(jiebaR)
library(tidytext)
library(NLP)
library(tidyr)
library(ggplot2)
library(ggraph)
library(igraph)
library(scales)
library(reshape2)
library(widyr)
library(data.table)
library(wordcloud2)讀入資料並初步合併處理
#rm(list=ls())
#資料說明
##1.女孩版資料(ptt_Catch_GirlsBoard_articleReviews),共3,685筆資料
csv <- fread("./data/ptt_Catch_GirlsBoard_articleReviews.csv", encoding = "UTF-8")
count(csv)##2.婚姻版資料(ptt_Catch_MarryBoard__articleReviews),共1,553筆資料
csv1 <- fread("./data/ptt_Catch_MarryBoard__articleReviews.csv", encoding = "UTF-8")
count(csv1)# ##3.八卦版討論主題(ptt_Gossiping_articleMetaData),共397筆資料
# csv2 <- fread("./data/ptt_Gossiping_articleMetaData.csv", encoding = "UTF-8")
# count(csv2)
##4.八卦版討論之留言(ptt_Gossiping_articleReviews),共12,548筆資料
csv3 <- fread("./data/ptt_Gossiping_articleReviews.csv", encoding = "UTF-8")
count(csv3)根據不同的條件,再篩選一次網友的文章: 以錢、薪水來篩選 以房、家來篩選 以教、養來篩選
keywords = c('錢','薪水')
toMatch = paste(keywords,collapse="|")
csvT01 = with(csvT, csvT[grepl(toMatch,cmtContent)|grepl(toMatch,artTitle),])
keywords = c('房','家')
toMatch = paste(keywords,collapse="|")
csvT02 = with(csvT, csvT[grepl(toMatch,cmtContent)|grepl(toMatch,artTitle),])
keywords = c('教','養')
toMatch = paste(keywords,collapse="|")
csvT03 = with(csvT, csvT[grepl(toMatch,cmtContent)|grepl(toMatch,artTitle),])
# 以錢、薪水來篩選
csvT01# 加入停用的字典
jieba_tokenizer <- worker(user="./dict/user_dict.txt", stop_word = "./dict/stop_words.txt")
# 設定斷詞function
customized_tokenizer <- function(t) {
lapply(t, function(x) {
tokens <- segment(x, jieba_tokenizer)
tokens <- tokens[!tokens %in% stop_words]
# 去掉字串長度爲1的詞彙
tokens <- tokens[nchar(tokens)>1]
return(tokens)
})
}根據不同的主題繪製文字雲來初步分析 以錢、薪水來篩選,可以得知網友大多認為生小孩是需要一定花費
library(wordcloud2)
tokens_count_csvT01T <- csvT01T %>%
count(word, sort = TRUE) %>%
filter(word!=c('小孩','生小孩','生孩子')) %>%
filter(n>10) #只取出現10次以上
tokens_count_csvT01T %>% wordcloud2(size=0.8)以房、家來篩選,可以得知網友大多認為生小孩是需要買房而且要三房,大部分也認為要背房貸,甚至有部分網友也有論到生育率問題。
library(wordcloud2)
tokens_count_csvT02T <- csvT02T%>%
count(word, sort = TRUE) %>%
filter(n>30) #只取出現30次以上
tokens_count_csvT02T %>% wordcloud2(size = 0.8)Caption for the picture.
以教、養來篩選,可以得知網友大多認為生小孩是養兒防老,也認為教養需要一定花費(包含房貸),甚至有網友討論領養、棄養問題。
tokens_count_csvT03T <- csvT03T%>%
count(word, sort = TRUE) %>%
filter(word!=c('小孩','生小孩','生孩子')) %>%
filter(n>10) #只取出現10次以上
tokens_count_csvT03T %>% wordcloud2(size=0.5)Caption for the picture.
全名Linguistic Inquiry and Word Counts,由心理學家Pennebaker於2001出版 分為正向情緒與負向情緒
#讀檔,字詞間以","將字分隔
P <- read_file("./liwc/positive.txt") # 正向字典txt檔
N <- read_file("./liwc/negative.txt") # 負向字典txt檔
#字典txt檔讀進來是一整個字串
typeof(P)## [1] "character"
#分割字詞,並將兩個情緒字典併在一起
# 將字串依,分割
# strsplit回傳list , 我們取出list中的第一個元素
P = strsplit(P, ",")[[1]]
N = strsplit(N, ",")[[1]]
# 建立dataframe 有兩個欄位word,sentiments,word欄位內容是字典向量
P = data.frame(word = P, sentiment = "positive") #664
N = data.frame(word = N, sentiment = "negative") #1047
# 把兩個字典拼在一起
LIWC = rbind(P, N)
# 檢視字典
head(LIWC)初步繪製每日網友討論不同主題之留言數直方圖 可以發現多數網友認為,生小孩或養小孩之先後考量為「錢>房>養」
csvT01C <-mutate(csvT01,category='1') ##1.以錢、薪水來篩選
csvT02C <-mutate(csvT02,category='2') ##2.以房、家來篩選
csvT03C <-mutate(csvT03,category='3') ##3.以教、養來篩選
csvTC = rbind(csvT01C, csvT02C, csvT03C) ##三種分類合併來分析
csvTC$artDate= csvTC$artDate %>% as.Date("%Y/%m/%d")
csvTC %>%
group_by(artDate) %>%
summarise(count = n(),category) %>%
ggplot( aes(x=artDate, y=count, fill=category)) +
geom_bar(stat="identity")## `summarise()` has grouped output by 'artDate'. You can override using the `.groups` argument.
分析前之資料前處理與算出每天情緒總和(sentiment_count)
csvT01TC <-mutate(csvT01T,category='1') ##1.以錢、薪水來篩選
csvT02TC <-mutate(csvT02T,category='2') ##2.以房、家來篩選
csvT03TC <-mutate(csvT03T,category='3') ##3.以教、養來篩選
csvTTC = rbind(csvT01TC, csvT02TC, csvT03TC) ##三種分類合併來分析
# 日期格式化
csvTTC$artDate= csvTTC$artDate %>% as.Date("%Y/%m/%d")
# 過濾特殊字元
data_select = csvTTC %>%
filter(!grepl('[[:punct:]]',word)) %>% # 去標點符號
filter(!grepl("['^0-9a-z']",word)) %>% # 去英文、數字
filter(nchar(.$word)>1)
# 計算每隔類別留言的情緒分數
#sentiment_count:artDate,sentiment,category,count
sentiment_count = data_select %>%
select(artDate,word,category) %>%
inner_join(LIWC) %>%
group_by(artDate,sentiment,category) %>%
summarise(count=n()) %>%
spread(sentiment,count,fill = 0) %>%
mutate(sentiment = positive - negative)## Joining, by = "word"
## `summarise()` has grouped output by 'artDate', 'sentiment'. You can override using the `.groups` argument.
根據不同的留言種類,進行正負情緒分數之折線圖分析:
1.以錢、薪水來篩選之分析,情緒多為正面為主。
2.以房、家來篩選之分析,情緒多為負面為主。
3.以教、養來篩選之分析,情緒多為差距不大。
sentiment_count %>%
ggplot()+
geom_line(aes(x=artDate,y=sentiment,colour=category))+
scale_x_date(labels = date_format("%m/%d"))根據不同的留言種類,找出正負情緒之關鍵字: 經分析發現各主題共通的正向關鍵字都有錢,而負向關鍵字則討論問題居多 以錢、薪水來篩選正負情緒關鍵字。
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select[category=='1'] %>%
select(word,category) %>%
inner_join(LIWC) %>%
group_by(word,sentiment,category) %>%
summarise(count=n()) %>% # 算字詞單篇總數用summarise
filter(count>3) %>% # 過濾出現太少次的字
arrange(desc(count))%>%
head(20)## Joining, by = "word"
## `summarise()` has grouped output by 'word', 'sentiment'. You can override using the `.groups` argument.
word_count %>%
ggplot(aes(x=count, y=reorder(word,count),fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
theme(text=element_text(size=14))以房、家來篩選正負情緒關鍵字。
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select[category=='2'] %>%
select(word,category) %>%
inner_join(LIWC) %>%
group_by(word,sentiment,category) %>%
summarise(count=n()) %>% # 算字詞單篇總數用summarise
filter(count>3) %>% # 過濾出現太少次的字
arrange(desc(count))%>%
head(20)## Joining, by = "word"
## `summarise()` has grouped output by 'word', 'sentiment'. You can override using the `.groups` argument.
word_count %>%
ggplot(aes(x=count, y=reorder(word,count),fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
theme(text=element_text(size=14))以教、養來篩選正負情緒關鍵字。
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select[category=='3'] %>%
select(word,category) %>%
inner_join(LIWC) %>%
group_by(word,sentiment,category) %>%
summarise(count=n()) %>% # 算字詞單篇總數用summarise
filter(count>3) %>% # 過濾出現太少次的字
arrange(desc(count))%>%
head(20)## Joining, by = "word"
## `summarise()` has grouped output by 'word', 'sentiment'. You can override using the `.groups` argument.
word_count %>%
ggplot(aes(x=count, y=reorder(word,count),fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
theme(text=element_text(size=14))tf-idf分析的前置處理
# 算每天不同字的詞頻
# word_count:artDate,word,count
word_count <- data_select %>%
select(word,category,artUrl) %>%
group_by(word,category,artUrl) %>%
summarise(count=n()) %>% # 算字詞單篇總數用summarise
arrange(desc(count))## `summarise()` has grouped output by 'word', 'category'. You can override using the `.groups` argument.
#計算每篇文章留言的詞數
total_words <- word_count %>%
group_by(artUrl,category) %>%
summarise(total=(sum(count)))## `summarise()` has grouped output by 'artUrl'. You can override using the `.groups` argument.
## Joining, by = c("category", "artUrl")
以每篇文章爲單位,計算每個詞彙的 tf-idf 值,並分析前100名的詞彙 1.以錢、薪水來篩選之分析,常見的字詞分別為:
##1.以錢、薪水來篩選之分析。
csvTTC_tf_idf1 <- csvTTC_join%>%
bind_tf_idf(word, artUrl, count) %>%
filter(category=='1') %>%
group_by(artUrl) %>%
arrange(desc(tf_idf)) %>%
select(word,tf_idf) %>%
head(100)
csvTTC_tf_idf12.以房、家來篩選之分析。
##2.以房、家來篩選之分析。
csvTTC_tf_idf2 <- csvTTC_join%>%
bind_tf_idf(word, artUrl, count) %>%
filter(category=='2') %>%
group_by(artUrl) %>%
arrange(desc(tf_idf)) %>%
select(word,tf_idf) %>%
head(100)
csvTTC_tf_idf23.以教、養來篩選之分析。
##3.以教、養來篩選之分析。
csvTTC_tf_idf3 <- csvTTC_join%>%
bind_tf_idf(word, artUrl, count) %>%
filter(category=='3') %>%
group_by(artUrl) %>%
arrange(desc(tf_idf)) %>%
select(word,tf_idf) %>%
head(100)
csvTTC_tf_idf3留言初始處理
#計算單一詞彙及出現次數
csvTTC_words <- csvTTC %>%
filter(!str_detect(word, regex("[0-9a-zA-Z]"))) %>%
count(artUrl, word, category, sort = TRUE)
#計算兩個詞彙同時出現的總次數與相關性###總次數
csvTTC_words_pairs1 <- csvTTC_words[category=='1'] %>%
pairwise_count(word, artUrl, sort = TRUE)
csvTTC_words_pairs1###相關性
csvTTC_words_cor1 <- csvTTC_words[category=='1'] %>%
group_by(word) %>%
filter(n() >= 10) %>%
pairwise_cor(word, artUrl, sort = TRUE)
csvTTC_words_cor1###總次數
csvTTC_words_pairs2 <- csvTTC_words[category=='2'] %>%
pairwise_count(word, artUrl, sort = TRUE)
csvTTC_words_pairs2###相關性
csvTTC_words_cor2 <- csvTTC_words[category=='2'] %>%
group_by(word) %>%
filter(n() >= 10) %>%
pairwise_cor(word, artUrl, sort = TRUE)
csvTTC_words_cor2###總次數
csvTTC_words_pairs3 <- csvTTC_words[category=='3'] %>%
pairwise_count(word, artUrl, sort = TRUE)
csvTTC_words_pairs3###相關性
csvTTC_words_cor3 <- csvTTC_words[category=='3'] %>%
group_by(word) %>%
filter(n() >= 10) %>%
pairwise_cor(word, artUrl, sort = TRUE)
csvTTC_words_cor3使用詞彙關係圖畫出相關性的組合: 以錢、薪水來篩選之分析。
##1.以錢、薪水來篩選之分析。
set.seed(100)
csvTTC_words_cor1 %>%
filter(correlation > 0.3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") + #加入中文字型設定,避免中文字顯示錯誤。
theme_void()## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
以房、家來篩選之分析。
##2.以房、家來篩選之分析。
set.seed(100)
csvTTC_words_cor2 %>%
filter(correlation > 0.48) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") + #加入中文字型設定,避免中文字顯示錯誤。
theme_void()## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
以教、養來篩選之分析。
##3.以教、養來篩選之分析。
set.seed(100)
csvTTC_words_cor3 %>%
filter(correlation > 0.285) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 3) +
geom_node_text(aes(label = name), repel = TRUE, family = "Heiti TC Light") + #加入中文字型設定,避免中文字顯示錯誤。
theme_void()## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
- 以上述統計數據來看,可以看得出現在年輕人主要不生小孩與生小孩需要花很多錢有高度的相關性,雖然正向的情緒分析大於負面情緒,但我們覺得應該是誤判,像是有錢被判斷為正向情緒,但其實在討論版上的意思是生小孩需要有錢。
- 而高房價也有著間接影響生育關係,因從情緒分析來看,負面情緒較正面情緒多,從討論看得出,現在年輕人隨著房價越來越高,也會降低生小孩的意願。
- 最後養育情緒分析較平均,因此看不出是否與生小孩有直接關係。