A.動機和分析目的

  • 目的:使用coreNLP與sentimentr分析twitter上日本東京奧運的文字資料
  • 概述:原訂於2020年舉辦的東京夏季奧運,因受全球新冠肺炎疫情影響,延至2021年舉辦。藉此利用推特的文章分析民眾對“日本東京奧運”的相關情緒。

B.資料集的描述

  • 資料來源:Twitter,4/09~4/16,5000筆,English

C.資料分析的過程

C-1 coreNLP 自然語言處理

安裝package

packages = c("dplyr","ggplot2","rtweet" ,"xml2", "httr", "jsonlite", "data.tree", "NLP", "igraph","sentimentr","tidytext","wordcloud2","DiagrammeR","dplyr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
library(wordcloud2)
library(ggplot2)
library(scales)
library(rtweet)
library(dplyr)
library(xml2)
library(httr)
library(jsonlite)
library(magrittr)
library(data.tree)
library(tidytext)
library(stringr)
library(DiagrammeR)
library(magrittr)
load("tokyo2020.RData")

資料收集:tweets

(1). Twitter API設定 透過rtweet抓取tweets

app = 'heminghsin'
consumer_key = 'kjJGO9cTWdoCG9BHgBwfFtcfi'
consumer_secret = 'zjamiuhUuWZjjbsi01Jlg38uwVyXeQpiuFkUDk6QhiSIC379UO'
access_token = '1380751997343698949-qYevxQu1xqqP4dYegz4VeZUzSASi44'
access_secret = 'Lnbft6o7jK2Du8X9087qTFRRhs3UA2coYJSRi4KnrbR9d'
twitter_token <- create_token(app,consumer_key, consumer_secret,
                    access_token, access_secret,set_renv = FALSE)
#Consumer Keys:知道你的身分
#Authentication Tokens:認證給你的授權

(2). 設定關鍵字抓tweets

# 查詢關鍵字
key = c("tokyo")
context = "Olympics"
q = paste(c(key,context),collapse=" AND ")   
# 查詢字詞 "#tokyo AND Olympics"
# 為了避免只下#tokyo 會找到非在Olympics中的tweets,加入Olympics要同時出現的條件

#抓5000筆 不抓轉推
tweets = search_tweets(q,lang="en",n=5000,include_rts = FALSE,token = twitter_token)

(3). tweets內容清理

## 用於資料清理
clean = function(txt) {
  txt = iconv(txt, "latin1", "ASCII", sub="") #改變字的encoding
  txt = gsub("(@|#)\\w+", "", txt) #去除@或#後有數字,字母,底線 (標記人名或hashtag)
  txt = gsub("(http|https)://.*", "", txt) #去除網址(.:任意字元,*:0次以上)
  txt = gsub("[ \t]{2,}", "", txt) #去除兩個以上空格或tab
  txt = gsub("\\n"," ",txt) #去除換行
  txt = gsub("\\s+"," ",txt) #去除一個或多個空格(+:一次以上)
  txt = gsub("^\\s+|\\s+$","",txt) #去除開頭/結尾有一個或多個空格
  txt = gsub("&.*;","",txt) #去除html特殊字元編碼
  txt = gsub("[^a-zA-Z0-9?!. ']","",txt) #除了字母,數字空白?!.的都去掉(表情符號去掉)
  txt }


tweets$text = clean(tweets$text)  #text套用資料清理

df = data.frame()
  
df = rbind(df,tweets)  # transfer to data frame

df = df[!duplicated(df[,"status_id"]),]  #去除重複的tweets
head(df)
## # A tibble: 6 x 90
##   user_id   status_id   created_at          screen_name  text            source 
##   <chr>     <chr>       <dttm>              <chr>        <chr>           <chr>  
## 1 74479495… 1383304039… 2021-04-17 06:19:18 hvmojidra    Historic! Army… Twitte…
## 2 42584864  1383303679… 2021-04-17 06:17:52 twittyoota   Tokyo Olympics… Twitte…
## 3 427588761 1383303625… 2021-04-17 06:17:39 PeterConsta… 2004 OLYMPICS … Twitte…
## 4 427588761 1382643201… 2021-04-15 10:33:22 PeterConsta… OLYMPICS 2004 … Twitte…
## 5 127500821 1383303212… 2021-04-17 06:16:01 novelletten  Tokyo Olympics… Twitte…
## 6 7309052   1383303193… 2021-04-17 06:15:56 YahooNews    There are a va… Social…
## # … with 84 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

df共有90個欄位,但我們在這裡僅會使用幾個欄位:

  • user_id: 用戶id
  • status_id : 推文id
  • created_at : 發文時間
  • text : 推文內容
  • source : 發文來源

了解資料的資料筆數以及時間分布

created_at已經是一個date類型的欄位,因此可以直接用min,max來看最遠或最近的日期
註:rtweet最多只能抓到距今10天的資料

nrow(df)
## [1] 5000
min(df$created_at)
## [1] "2021-04-14 12:40:30 UTC"
max(df$created_at)
## [1] "2021-04-17 06:19:18 UTC"

串接CoreNLP API

(1). API呼叫的設定

server端 : + 需先在terminal開啟corenlp server + 在corenlp的路徑下開啟terminal輸入 java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

# 產生coreNLP的api url,將本地端的網址轉成符合coreNLP服務的url
generate_API_url <- function(host, port="9000",
                    tokenize.whitespace="false", annotators=""){ #斷詞依據不是空格
    url <- sprintf('http://%s:%s/?properties={"tokenize.whitespace":"%s","annotators":"%s"}', host, port, tokenize.whitespace, annotators)
    url <- URLencode(url)
}
#指定服務的位置
host = "127.0.0.1"

generate_API_url(host)
# 呼叫coreNLP api
call_coreNLP <- function(server_host, text, host="localhost", language="eng",
                    tokenize.whitespace="true", ssplit.eolonly="true", annotators=c("tokenize","ssplit","pos","lemma","ner","parse","sentiment")){
  # 假設有兩個core-nlp server、一個負責英文(使用9000 port)、另一個則負責中文(使用9001 port)
  port <- ifelse(language=="eng", 9000, 9001);
  # 產生api網址
  url <- generate_API_url(server_host, port=port,
                    tokenize.whitespace=tokenize.whitespace, annotators=paste0(annotators, collapse = ','))
  
  result <- POST(url, body = text, encode = "json")
  doc <- httr::content(result, "parsed","application/json",encoding = "UTF-8")
  return (doc)
}
#文件使用coreNLP服務
coreNLP <- function(data,host){
  # 依序將每個文件丟進core-nlp進行處理,每份文件的回傳結果為json格式
  # 在R中使用objects來儲存處理結果
  result <- apply(data, 1 , function(x){
    object <- call_coreNLP(host, x['text'])
    list(doc=object, data=x)
  })
  
  return(result)
}

(2). 資料整理function

從回傳的object中整理斷詞出結果,輸出為 tidydata 格式

coreNLP_tokens_parser <- function(coreNLP_objects){
  
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
   
    sen <- sentences[[1]]
    
    tokens <- do.call(rbind, lapply(sen$tokens, function(x){
      result <- data.frame(word=x$word, lemma=x$lemma, pos=x$pos, ner=x$ner)
      result
    }))
    
    tokens <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(tokens))) %>% 
      bind_cols(tokens)
    
    tokens
  }))
  return(result)
}

從回傳的core-nlp object中整理出詞彙依存關係,輸出為 tidydata 格式

coreNLP_dependency_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    sen <- sentences[[1]]
    dependencies <- do.call(rbind, lapply(sen$basicDependencies, function(x){
      result <- data.frame(dep=x$dep, governor=x$governor, governorGloss=x$governorGloss, dependent=x$dependent, dependentGloss=x$dependentGloss)
      result
    }))
  
    dependencies <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(dependencies))) %>% 
      bind_cols(dependencies)
    dependencies
  }))
  return(result)
}

從回傳的core-nlp object中整理出語句情緒,輸出為 tidydata 格式

coreNLP_sentiment_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    sen <- sentences[[1]]
    
    sentiment <- original_data %>%
      t() %>% 
      data.frame() %>% 
      bind_cols(data.frame(sentiment=sen$sentiment, sentimentValue=sen$sentimentValue))
  
    sentiment
  }))
  return(result)
}

圖形化 Dependency tree

程式參考來源:https://stackoverflow.com/questions/35496560/how-to-convert-corenlp-generated-parse-tree-into-data-tree-r-package

# 圖形化顯示dependency結果
parse2tree <- function(ptext) {
  stopifnot(require(NLP) && require(igraph))
  
  # this step modifies coreNLP parse tree to mimic openNLP parse tree
  ptext <- gsub("[\r\n]", "", ptext)
  ptext <- gsub("ROOT", "TOP", ptext)


  ## Replace words with unique versions
  ms <- gregexpr("[^() ]+", ptext)                                      # just ignoring spaces and brackets?
  words <- regmatches(ptext, ms)[[1]]                                   # just words
  regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words))))  # add id to words
  
  ## Going to construct an edgelist and pass that to igraph
  ## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
  edgelist <- matrix('', nrow=length(words)-2, ncol=2)
  
  ## Function to fill in edgelist in place
  edgemaker <- (function() {
    i <- 0                                       # row counter
    g <- function(node) {                        # the recursive function
      if (inherits(node, "Tree")) {            # only recurse subtrees
        if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
          for (child in node$children) {
            childval <- if(inherits(child, "Tree")) child$value else child
            i <<- i+1
            edgelist[i,1:2] <<- c(val, childval)
          }
        }
        invisible(lapply(node$children, g))
      }
    }
  })()
  
  ## Create the edgelist from the parse tree
  edgemaker(Tree_parse(ptext))
  tree <- FromDataFrameNetwork(as.data.frame(edgelist))
  return (tree)
}

將句子丟入服務

取得coreNLP回傳的物件
跑這段,會花大概分鐘

gc() #釋放不使用的記憶體

t0 = Sys.time()
obj = df[,c(2,5)]  %>% filter(text != "") %>% coreNLP(host) #丟入本地執行
#丟入coreNLP的物件 必須符合: 是一個data.frame 有一個text欄位

Sys.time() - t0 #執行時間
#Time difference of 10 mins

save.image("tokyo2020.RData")
#先將會用到的東西存下來,要用可直接載RData
#tokens =  coreNLP_tokens_parser(obj)
#dependencies = coreNLP_dependency_parser(obj)
#sentiment = coreNLP_sentiment_parser(obj)
#save.image("coreNLP_all.RData")

提取結果

(1). 斷詞、詞彙還原、詞性標註、NER

tokens =  coreNLP_tokens_parser(obj)
head(tokens,20)
##              status_id      word     lemma pos      ner lower_word lower_lemma
## 1  1383119211250262019       Wow       wow  UH        O        wow         wow
## 2  1383119211250262019      that      that  DT        O       that        that
## 3  1383119211250262019   newsong   newsong  NN        O    newsong     newsong
## 4  1383119211250262019      hits       hit VBZ        O       hits         hit
## 5  1383119211250262019     deep.     deep.  RB        O      deep.       deep.
## 6  1383119211250262019        We        we PRP        O         we          we
## 7  1383119211250262019      need      need VBP        O       need        need
## 8  1383119211250262019      more      more JJR        O       more        more
## 9  1383119211250262019    voices     voice NNS        O     voices       voice
## 10 1383119211250262019        in        in  IN        O         in          in
## 11 1383119211250262019       the       the  DT        O        the         the
## 12 1383119211250262019    social    social  JJ IDEOLOGY     social      social
## 13 1383119211250262019   justice   justice  NN IDEOLOGY    justice     justice
## 14 1383119211250262019 platform. platform.  NN        O  platform.   platform.
## 15 1383119211250262019     Thank     thank VBP        O      thank       thank
## 16 1383119211250262019       for       for  IN        O        for         for
## 17 1383119211250262019       you       you PRP        O        you         you
## 18 1383119211250262019     that.     that. VBP        O      that.       that.
## 19 1383119211250262019      This      this  DT        O       this        this
## 20 1383119211250262019      song      song  NN        O       song        song
  • coreNLP_tokens_parser欄位:
    • status_id : 對應原本df裡的status_id,為一則tweets的唯一id
    • word: 原始斷詞
    • lemma : 對斷詞做詞形還原
    • pos : part-of-speech,詞性
    • ner: 命名實體

(2). 命名實體標註(NER)

  • 從NER查看特定類型的實體,辨識出哪幾種類型
unique(tokens$ner)
##  [1] "O"                 "IDEOLOGY"          "MISC"             
##  [4] "LOCATION"          "CITY"              "DATE"             
##  [7] "DURATION"          "NATIONALITY"       "PERSON"           
## [10] "COUNTRY"           "STATE_OR_PROVINCE" "NUMBER"           
## [13] "ORGANIZATION"      "CRIMINAL_CHARGE"   "CAUSE_OF_DEATH"   
## [16] "SET"               "ORDINAL"           "TITLE"            
## [19] "MONEY"             "TIME"              "URL"              
## [22] "RELIGION"          "PERCENT"
#除去entity為Other,有多少種word有被標註entity
length(unique(tokens$word[tokens$ner != "O"])) 
## [1] 908

(3). 轉小寫

因為大小寫也會影響corenlp對NER的判斷,因此我們一開始給的推文內容是沒有處理大小寫的,但在跑完anotator後,為了正確計算詞頻,創建新欄位lower_word與lower_lemma,存放轉換小寫的word與lemma。轉成小寫的目的是要將不同大小寫的同一字詞(如Evergiven與evergiven)都換成小寫,再來計算詞頻

tokens$lower_word = tolower(tokens$word)
tokens$lower_lemma = tolower(tokens$lemma)

C-2 Sentimentr 英文情緒分析

sentimentr

library(sentimentr)

mytext <- c(
    'do you like it?  But I hate really bad dogs',
    'I am the best friend.',
    'Do you really like it?  I\'m not a fan'
)

mytext <- get_sentences(mytext) #物件,將character向量轉成list,list裡放著character向量(已斷句)
每個文本的情緒分數

情緒分數為-1~1之間,<0屬於負面,>0屬於正面,0屬於中性

sentiment_by(mytext) #document level
##    element_id word_count       sd ave_sentiment
## 1:          1         10 1.497465    -0.8088680
## 2:          2          5       NA     0.5813777
## 3:          3          9 0.284605     0.2196345
每個句子的情緒分數
sentiment(mytext) #sentence level
##    element_id sentence_id word_count  sentiment
## 1:          1           1          4  0.2500000
## 2:          1           2          6 -1.8677359
## 3:          2           1          5  0.5813777
## 4:          3           1          5  0.4024922
## 5:          3           2          4  0.0000000
  • 回傳4個欄位的dataframe:
    • element_id – 第幾個文本
    • sentence_id – 該文本中的第幾個句子
    • word_count – 句子字數
    • sentiment – 句子的情緒分數

使用twitter資料實踐在sentimentr

計算tweet中屬於正面的字
set.seed(10)
mytext <- get_sentences(tweets$text) #將text轉成list of characters型態
x <- sample(tweets$text, 1000, replace = FALSE) #隨機取1000筆,取後不放回
sentiment_words <- extract_sentiment_terms(x) #抓取其中帶有情緒的字
sentiment_counts <- attributes(sentiment_words)$counts #計算出現次數
sentiment_counts[polarity > 0,]   #正面的字
##         words polarity  n
##   1:      top      1.0 35
##   2:   please      1.0  7
##   3:     tops      1.0  4
##   4: congrats      1.0  3
##   5:     care      1.0  3
##  ---                     
## 349:    offer      0.1  1
## 350:  veteran      0.1  1
## 351: momentum      0.1  1
## 352: building      0.1  1
## 353:    moral      0.1  1
計算tweet中屬於負面的字
sentiment_counts[polarity < 0,] %>% arrange(desc(n)) %>% top_n(10) #出現次數最多的負面字
## Selecting by n
##          words polarity  n
##  1:   pandemic    -1.00 64
##  2:     cancel    -0.75 45
##  3:  cancelled    -0.80 39
##  4:    failure    -0.75 37
##  5: cancelling    -0.80 37
##  6:  postponed    -0.40 28
##  7: infections    -1.00 27
##  8: government    -0.50 24
##  9:      virus    -0.50 23
## 10:       worn    -0.10 20
## 11:  countdown    -0.25 20
highlight每個句子,判斷屬於正/負面
set.seed(12)
df%>%
    filter(status_id %in% sample(unique(status_id), 30)) %>% #隨機30筆貼文
    mutate(review = get_sentences(text)) %$% 
    sentiment_by(review, status_id) %>%
    highlight()
## Saved in /var/folders/kc/5zttnvj52nvdl3ckb52dvjnh0000gn/T//RtmpJ0WG8f/polarity.html
## Opening /var/folders/kc/5zttnvj52nvdl3ckb52dvjnh0000gn/T//RtmpJ0WG8f/polarity.html ...

D.視覺化的分析結果與解釋

探索分析 - NER

涉及到的國家(COUNTRY)

我們可以透過coreNLP中的NER解析出在Twitter上面談論長賜號擱淺蘇伊士運河,所涉及到的國家(COUNTRY),以初步了解這個議題的主要國家。

tokens %>%
  filter(ner == "COUNTRY") %>%  #篩選NER為COUNTRY
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 5, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col(fill="#8babd3")+
  ggtitle("Word Frequency (NER is COUNTRY)") +
  theme(text=element_text(size=14))

  • 在「日本」舉辦Olympics夏季運動會。
  • 奧運會延遲原因是受「中國」新冠肺炎疫影響。
涉及到的組織(ORGANIZATION)

我們可以透過coreNLP中的NER解析出在Twitter上面談論長賜號擱淺蘇伊士運河,所涉及到的組織(ORGANIZATION),以初步了解這個議題的主要公司/單位。

tokens %>%
  filter(ner == "ORGANIZATION") %>%  #篩選NER為ORGANIZATION
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col(fill="#ffc080")+
  ggtitle("Word Frequency (NER is ORGANIZATION)") +
  theme(text=element_text(size=14))

  • 自由民主黨(英語: Liberal Democratic Party,縮寫LDP),簡稱自民黨,目前主政的政黨。
  • 路透通訊社(英語:Reuters),簡稱路透社。
涉及到的人物(PERSON)

我們可以透過coreNLP中的NER解析出在Twitter上面談論長賜號擱淺蘇伊士運河,所涉及到的人物(PERSON),以初步了解這個議題的主要人物。

tokens %>%
  filter(ner == "PERSON") %>%  #篩選NER為PERSON
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col(fill="#c65911")+
  ggtitle("Word Frequency (NER is PERSON)") +
  theme(text=element_text(size=14))

  • Suga: 日本首相菅義偉(Yoshihide Suga )
  • Nikai:執政黨的自民黨幹事長二階俊博(Toshihiro Nikai)

探索分析 - Dependency

語句依存關係結果
dependencies = coreNLP_dependency_parser(obj)
head(dependencies,20)
##              status_id       dep governor governorGloss dependent
## 1  1383119211250262019      ROOT        0          ROOT         7
## 2  1383119211250262019 discourse        7          need         1
## 3  1383119211250262019       det        3       newsong         2
## 4  1383119211250262019     nsubj        4          hits         3
## 5  1383119211250262019 parataxis        7          need         4
## 6  1383119211250262019    advmod        4          hits         5
## 7  1383119211250262019     nsubj        7          need         6
## 8  1383119211250262019      amod        9        voices         8
## 9  1383119211250262019     nsubj       15         Thank         9
## 10 1383119211250262019      case       14     platform.        10
## 11 1383119211250262019       det       14     platform.        11
## 12 1383119211250262019      amod       14     platform.        12
## 13 1383119211250262019  compound       14     platform.        13
## 14 1383119211250262019      nmod        9        voices        14
## 15 1383119211250262019     ccomp        7          need        15
## 16 1383119211250262019      mark       18         that.        16
## 17 1383119211250262019     nsubj       18         that.        17
## 18 1383119211250262019     advcl       15         Thank        18
## 19 1383119211250262019       det       20          song        19
## 20 1383119211250262019     nsubj       25          song        20
##    dependentGloss
## 1            need
## 2             Wow
## 3            that
## 4         newsong
## 5            hits
## 6           deep.
## 7              We
## 8            more
## 9          voices
## 10             in
## 11            the
## 12         social
## 13        justice
## 14      platform.
## 15          Thank
## 16            for
## 17            you
## 18          that.
## 19           This
## 20           song
視覺化 Dependency tree
parse_tree <- obj[[113]]$doc[[1]][[1]]$parse
tree <- parse2tree(parse_tree)
SetNodeStyle(tree, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow")
plot(tree)

探索分析 - Sentiment

語句情緒值

情緒分數從最低分0~最高分4
+ 0,1 : very negative,negative
+ 2 : neutral
+ 3,4 : very positive,postive

sentiment = coreNLP_sentiment_parser(obj)
head(sentiment,20)
##              status_id
## 1  1383119211250262019
## 2  1383118031488966657
## 3  1383117882360426497
## 4  1382921685142097920
## 5  1382913839730085888
## 6  1383117571902300160
## 7  1383116765509603328
## 8  1383116563214176262
## 9  1382958366163615745
## 10 1382961049490620416
## 11 1383116269361192961
## 12 1383115368017825793
## 13 1382348436410880006
## 14 1382679091573354496
## 15 1383114935601876996
## 16 1383114484122845189
## 17 1383114323740954627
## 18 1382449908808962050
## 19 1383113274347118593
## 20 1382938068571807746
##                                                                                                                                                                                                                                                                                 text
## 1                                                                                                                 Wow that newsong hits deep. We need more voices in the social justice platform. Thank for you that. This song should be the opening song to the Olympics in Tokyo.
## 2                                                                             The head of the Tokyo Olympics on Friday was again forced to assure the world that the postponed games will open in just over three months and not be canceled despite surging COVID19 cases in Japan.
## 3                                                                                                                                                                                           Japanese media protest against Tokyo organizers' controls on news reports about Olympics
## 4                                                                                                                                                                                           Japanese media protest against Tokyo organizers' controls on news reports about Olympics
## 5                                                                                                                                                                                Shii Govt should cancel Tokyo Summer Olympics due to rampage of COVID19 variants at home and abroad
## 6                                                                                                                                                                                                  If U.S. boycotts Olympics they'll push their allies i.e Japan too to do the same.
## 7                                                                                                                                                                                                        Japan urged to outlaw LGBTQ discrimination before Olympics  The Japan Times
## 8                                                                                                                                                                                      Olympics must be 'reconsidered' due to Japan's failure to contain the pandemic report says Va
## 9                                                                                                                                     Olympic organizers are again being forced to assure the world that the postponed games will open in just 100 days despite surgingcases in . Va
## 10                                                                                                                                                                                             Olympics must be 'reconsidered' due to Japan's failure to contain pandemic  report Va
## 11                                                                                                                                                                                                                                                    Let's hope thecan still go on!
## 12                                                                                                                                                                        Tokyo Olympics must be 'reconsidered' due to Japan's failure to contain pandemic  report  2021416  Reuters
## 13                                                                                                                                             100 days still sounds so far away but three to four months sounds sooner. It's just so crazy how you put it she said while chuckling.
## 14                                                         Two officials in Japans ruling LDP party on Thursday said changes could be coming to the Tokyo Olympics. One suggested they still could be canceled and the other said even if they proceed it might be without any fans.
## 15                                                                            The head of the Tokyo Olympics on Friday was again forced to assure the world that the postponed games will open in just over three months and not be canceled despite surging COVID19 cases in Japan.
## 16 Instead of focusing all their energy on Olympics they should pay attention to thousands of students whove put their lives on hold because of Japanese government! No response no expected date Japan remains deaf to our problems and situation. We cannot enter since last year.
## 17                                                                       The modern Olympics are now synonymous with scandal including doping bribery physical abuse of athletes and have sparked suffering among the poor and working class class. Its Time to Rethink the Olympics
## 18                                                                                                                                                                                                            Get fucked!is ruining Japan and ruining lives. Cancel this fiasco now.
## 19                                                                                                                                                                                                                                Head of Tokyoagain says games will not be canceled
## 20                                                                                                                                                                                                                  Japan to widen coronavirus curbs casting fresh doubt on Olympics
##    sentiment sentimentValue
## 1   Positive              3
## 2   Negative              1
## 3   Negative              1
## 4   Negative              1
## 5   Positive              3
## 6    Neutral              2
## 7    Neutral              2
## 8   Negative              1
## 9   Negative              1
## 10  Negative              1
## 11   Neutral              2
## 12  Negative              1
## 13  Negative              1
## 14  Negative              1
## 15  Negative              1
## 16  Negative              1
## 17  Negative              1
## 18   Neutral              2
## 19   Neutral              2
## 20   Neutral              2
資料集中的情緒種類
unique(sentiment$sentiment)
## [1] "Positive"     "Negative"     "Neutral"      "Verypositive" "Verynegative"
sentiment$sentimentValue = sentiment$sentimentValue %>% as.numeric
#了解情緒文章的分佈
sentiment$sentiment %>% table()
## .
##     Negative      Neutral     Positive Verynegative Verypositive 
##          677         1050          260            2            1
平均情緒分數時間趨勢
df$date = as.Date(df$created_at)

sentiment %>% 
  merge(df[,c("status_id","source","date")]) %>%
  group_by(date) %>% 
  summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>% 
  ggplot(aes(x=date,y=avg_sentiment)) + 
  geom_line()

  • 隨著日本疫情的未改善,民眾情緒逐漸負面。
  • 4/15執政黨坦承東京奧運可能取消,出現情緒朝負面下滑情形。
不同用戶端情緒時間趨勢
sentiment %>% 
  merge(df[,c("status_id","source","date")]) %>%
  filter(source %in% c("Twitter Web Client","Twitter for iPhone","Twitter for Android")) %>% 
  group_by(date,source) %>% 
  summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>% 
  ggplot(aes(x=date,y=avg_sentiment,color=source)) + 
  geom_line()
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

  • 依據不同上網裝置,一樣可以看出隨著日本疫情的未改善,民眾情緒逐漸負面。
了解情緒分佈,以及在正面情緒及負面情緒下,所使用的文章詞彙為何?
#了解正面文章的詞彙使用
sentiment %>% 
  merge(tokens) %>% 
  anti_join(stop_words) %>% 
  filter(!lower_word %in% c('i','the')) %>% 
  filter(sentiment == "Verypositive" | sentiment =='Positive') %>%
  group_by(lower_lemma) %>% #根據lemma分組
  summarize(count = n()) %>% 
  filter(count >5 & count<400)%>%
  wordcloud2()
## Joining, by = "word"
  • 文字雲中看出正面情緒不高,判斷是已受疫情一年的影響。
#了解負面文章的詞彙使用
sentiment %>% 
  merge(tokens) %>% 
  anti_join(stop_words) %>% 
  filter(!lower_word %in% c('i','the')) %>% 
  filter(sentiment == "Verynegative" | sentiment =='Negative') %>%
  group_by(lower_lemma) %>% 
  summarize(count = n()) %>% 
  filter(count >10 &count<400)%>%
  wordcloud2()

  • 距離東京奧運開幕不到100 天,日本政府極力控制武漢疫情,確診數仍不斷升高。
  • 民調顯示,多數日本人民不贊成在疫情期間舉行奧運,甚至有人在日本推特發起的「取消奧運」(Canceling Olympics)活動。

探索分析 - 情緒波動

用日期來了解情緒波動
tweets$date = format(tweets$created_at,'%Y%m%d')

(out  = tweets  %>%  with(
    sentiment_by( #document level
        get_sentences(text), 
        list( date)
    )
))
plot(out)

用日期來了解不同用戶端情的緒波動
(out  = tweets %>% filter(source %in% c("Twitter Web Client","Twitter for iPhone","Twitter for Android")) %>%  with(
    sentiment_by(
        get_sentences(text), 
        list(source, date)
    )
))
plot(out)

轉換Emoji代碼為語意文字
replace_emoji("\U0001f4aa")
## [1] " flexed biceps "

E.結論

coreNLP

  1. 找出議題核心人物,組織,國家
  2. 用句法學的分析找出句子相依關係
  3. 分別找出正、負面文章的常用字

sentimentr

  1. 找到tweets中正負面的詞,並且計算每個文本中屬於正負面的句子有哪些
  2. 根據日期知道情緒的波動、不同用戶端的波動

心得

  1. 利用老師及助教上課的教材,透過Tiwtter API的設定,抓取資料分析。
  2. 取得日本奧運相關的文章,並透過coreNLP及sentimentr的分析,了解時事動態。
  3. 感謝助教在專案練習過程中給予指導,讓本組專案可以順利完成。