基本介紹

  • 概述:本組希望利用twitter 4/6-4/10的文章分析民眾對期間所發生有關新冠肺炎疫苗消息的看法與情緒。
  • 資料來源:Twitter,4/6~4/10,4911筆,English
  • 關鍵字:“#Vaccine”且有關“COVID”的文

1. coreNLP

安裝package

packages = c("dplyr","ggplot2","rtweet" ,"xml2", "httr", "jsonlite", "data.tree", "NLP", "igraph","sentimentr","tidytext","wordcloud2","DiagrammeR","dplyr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
library(wordcloud2)
library(ggplot2)
library(scales)
library(rtweet)
library(dplyr)
library(xml2)
library(httr)
library(jsonlite)
library(magrittr)
library(data.tree)
library(tidytext)
library(stringr)
library(DiagrammeR)
library(magrittr)
load("coreNLP_all2.RData")

1.1 資料收集:tweets

(1). Twitter API設定 透過rtweet抓取tweets

#app = '2021_sma'
#consumer_key = '71QW6sEHM2cRfYQVXPueSnXt7'
#consumer_secret = 'XLCbvKGF9WbDWAfcIAshql9LBwlyRaG6ZNx2zh8TaFzNaBqNob'
#access_token = '1363396212112547841-VA58XSsunKG0DLnE4qVbw2ncwGDmTW'
#access_secret = 'X4EhjmzZ24IvpU56ZfyzHFwLpLeUQ8ZShbR6OwTjHfHFU'
#twitter_token <- create_token(app,consumer_key, consumer_secret,
                    #access_token, access_secret,set_renv = FALSE)
#Consumer Keys:知道你的身分
#Authentication Tokens:認證給你的授權

(2). 設定關鍵字抓tweets

# 查詢關鍵字
#key = c("#Vaccine")
#context = "COVID"
#q = paste(c(key,context),collapse=" AND ")  

#抓5000筆 不抓轉推
#tweets = search_tweets(q,lang="en",n=5000,include_rts = FALSE,token = #twitter_token)

(3). tweets內容清理

## 用於資料清理
clean = function(txt) {
  txt = iconv(txt, "latin1", "ASCII", sub="") #改變字的encoding
  txt = gsub("(@|#)\\w+", "", txt) #去除@或#後有數字,字母,底線 (標記人名或hashtag)
  txt = gsub("(http|https)://.*", "", txt) #去除網址(.:任意字元,*:0次以上)
  txt = gsub("[ \t]{2,}", "", txt) #去除兩個以上空格或tab
  txt = gsub("\\n"," ",txt) #去除換行
  txt = gsub("\\s+"," ",txt) #去除一個或多個空格(+:一次以上)
  txt = gsub("^\\s+|\\s+$","",txt) #去除開頭/結尾有一個或多個空格
  txt = gsub("&.*;","",txt) #去除html特殊字元編碼
  txt = gsub("[^a-zA-Z0-9?!. ']","",txt) #除了字母,數字空白?!.的都去掉(表情符號去掉)
  txt }


tweets$text = clean(tweets$text)  #text套用資料清理

df = data.frame()
  
df = rbind(df,tweets)  # transfer to data frame

df = df[!duplicated(df[,"status_id"]),]  #去除重複的tweets
head(df)
## # A tibble: 6 x 90
##   user_id   status_id   created_at          screen_name  text            source 
##   <chr>     <chr>       <dttm>              <chr>        <chr>           <chr>  
## 1 613478079 1380880201~ 2021-04-10 13:47:50 AaryaNeha    My sense is th~ Twitte~
## 2 3575641   1380880055~ 2021-04-10 13:47:15 Heyes        first dose.     Instag~
## 3 8414992   1380879673~ 2021-04-10 13:45:44 waltomatic   Got the second~ Twitte~
## 4 12433342~ 1380879486~ 2021-04-10 13:44:59 EDC_inLasVe~ wont requirepa~ Twitte~
## 5 133822636 1380879363~ 2021-04-10 13:44:30 vishnuvy     46 of my resea~ Twitte~
## 6 11855317~ 1380878581~ 2021-04-10 13:41:24 The12fthOfN~ The largest Co~ Twitte~
## # ... with 84 more variables: display_text_width <dbl>,
## #   reply_to_status_id <chr>, reply_to_user_id <chr>,
## #   reply_to_screen_name <chr>, is_quote <lgl>, is_retweet <lgl>,
## #   favorite_count <int>, retweet_count <int>, quote_count <int>,
## #   reply_count <int>, hashtags <list>, symbols <list>, urls_url <list>,
## #   urls_t.co <list>, urls_expanded_url <list>, media_url <list>,
## #   media_t.co <list>, media_expanded_url <list>, media_type <list>,
## #   ext_media_url <list>, ext_media_t.co <list>, ext_media_expanded_url <list>,
## #   ext_media_type <chr>, mentions_user_id <list>, mentions_screen_name <list>,
## #   lang <chr>, quoted_status_id <chr>, quoted_text <chr>,
## #   quoted_created_at <dttm>, quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

df共有90個欄位,但我們在這裡僅會使用幾個欄位:

  • user_id: 用戶id
  • status_id : 推文id
  • created_at : 發文時間
  • text : 推文內容
  • source : 發文來源

了解資料的資料筆數以及時間分布

created_at已經是一個date類型的欄位,因此可以直接用min,max來看最遠或最近的日期
註:rtweet最多只能抓到距今10天的資料

nrow(df)
## [1] 4911
min(df$created_at)
## [1] "2021-04-06 16:40:02 UTC"
max(df$created_at)
## [1] "2021-04-10 13:47:50 UTC"

1-2串接CoreNLP API

(1). API呼叫的設定

server端 : + 需先在terminal開啟corenlp server + 在corenlp的路徑下開啟terminal輸入 java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

# 產生coreNLP的api url,將本地端的網址轉成符合coreNLP服務的url
generate_API_url <- function(host, port="9000",
                    tokenize.whitespace="false", annotators=""){ #斷詞依據不是空格
    url <- sprintf('http://%s:%s/?properties={"tokenize.whitespace":"%s","annotators":"%s"}', host, port, tokenize.whitespace, annotators)
    url <- URLencode(url)
}
#指定服務的位置
host = "127.0.0.1"

generate_API_url(host)
# 呼叫coreNLP api
call_coreNLP <- function(server_host, text, host="localhost", language="eng",
                    tokenize.whitespace="true", ssplit.eolonly="true", annotators=c("tokenize","ssplit","pos","lemma","ner","parse","sentiment")){
  # 假設有兩個core-nlp server、一個負責英文(使用9000 port)、另一個則負責中文(使用9001 port)
  port <- ifelse(language=="eng", 9000, 9001);
  # 產生api網址
  url <- generate_API_url(server_host, port=port,
                    tokenize.whitespace=tokenize.whitespace, annotators=paste0(annotators, collapse = ','))
  
  result <- POST(url, body = text, encode = "json")
  doc <- httr::content(result, "parsed","application/json",encoding = "UTF-8")
  return (doc)
}
#文件使用coreNLP服務
coreNLP <- function(data,host){
  # 依序將每個文件丟進core-nlp進行處理,每份文件的回傳結果為json格式
  # 在R中使用objects來儲存處理結果
  result <- apply(data, 1 , function(x){
    object <- call_coreNLP(host, x['text'])
    list(doc=object, data=x)
  })
  
  return(result)
}

(2). 資料整理function

從回傳的object中整理斷詞出結果,輸出為 tidydata 格式

coreNLP_tokens_parser <- function(coreNLP_objects){
  
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
   
    sen <- sentences[[1]]
    
    tokens <- do.call(rbind, lapply(sen$tokens, function(x){
      result <- data.frame(word=x$word, lemma=x$lemma, pos=x$pos, ner=x$ner)
      result
    }))
    
    tokens <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(tokens))) %>% 
      bind_cols(tokens)
    
    tokens
  }))
  return(result)
}

從回傳的core-nlp object中整理出詞彙依存關係,輸出為 tidydata 格式

coreNLP_dependency_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    sen <- sentences[[1]]
    dependencies <- do.call(rbind, lapply(sen$basicDependencies, function(x){
      result <- data.frame(dep=x$dep, governor=x$governor, governorGloss=x$governorGloss, dependent=x$dependent, dependentGloss=x$dependentGloss)
      result
    }))
  
    dependencies <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(dependencies))) %>% 
      bind_cols(dependencies)
    dependencies
  }))
  return(result)
}

從回傳的core-nlp object中整理出語句情緒,輸出為 tidydata 格式

coreNLP_sentiment_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    sen <- sentences[[1]]
    
    sentiment <- original_data %>%
      t() %>% 
      data.frame() %>% 
      bind_cols(data.frame(sentiment=sen$sentiment, sentimentValue=sen$sentimentValue))
  
    sentiment
  }))
  return(result)
}

圖形化 Dependency tree

程式參考來源:https://stackoverflow.com/questions/35496560/how-to-convert-corenlp-generated-parse-tree-into-data-tree-r-package

# 圖形化顯示dependency結果
parse2tree <- function(ptext) {
  stopifnot(require(NLP) && require(igraph))
  
  # this step modifies coreNLP parse tree to mimic openNLP parse tree
  ptext <- gsub("[\r\n]", "", ptext)
  ptext <- gsub("ROOT", "TOP", ptext)


  ## Replace words with unique versions
  ms <- gregexpr("[^() ]+", ptext)                                      # just ignoring spaces and brackets?
  words <- regmatches(ptext, ms)[[1]]                                   # just words
  regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words))))  # add id to words
  
  ## Going to construct an edgelist and pass that to igraph
  ## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
  edgelist <- matrix('', nrow=length(words)-2, ncol=2)
  
  ## Function to fill in edgelist in place
  edgemaker <- (function() {
    i <- 0                                       # row counter
    g <- function(node) {                        # the recursive function
      if (inherits(node, "Tree")) {            # only recurse subtrees
        if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
          for (child in node$children) {
            childval <- if(inherits(child, "Tree")) child$value else child
            i <<- i+1
            edgelist[i,1:2] <<- c(val, childval)
          }
        }
        invisible(lapply(node$children, g))
      }
    }
  })()
  
  ## Create the edgelist from the parse tree
  edgemaker(Tree_parse(ptext))
  tree <- FromDataFrameNetwork(as.data.frame(edgelist))
  return (tree)
}

將句子丟入服務

取得coreNLP回傳的物件
先不要跑這段,會花大概半小時(如果你記憶體只有4G可能會當掉)

gc() #釋放不使用的記憶體

t0 = Sys.time()
obj = df[,c(2,5)]  %>% filter(text != "") %>% coreNLP(host) #丟入本地執行
#丟入coreNLP的物件 必須符合: 是一個data.frame 有一個text欄位

Sys.time() - t0 #執行時間
#Time difference of 28 mins

save.image("coreNLP.RData")
#先將會用到的東西存下來,要用可直接載RData
#tokens =  coreNLP_tokens_parser(obj)
#dependencies = coreNLP_dependency_parser(obj)
#sentiment = coreNLP_sentiment_parser(obj)
#save.image("coreNLP_all.RData")

1-3 提取結果

(1). 斷詞、詞彙還原、詞性標註、NER

tokens =  coreNLP_tokens_parser(obj)
head(tokens,20)
##              status_id     word    lemma  pos  ner
## 1  1380880201240866821       My       my PRP$    O
## 2  1380880201240866821    sense    sense   NN    O
## 3  1380880201240866821       is       be  VBZ    O
## 4  1380880201240866821   theres   theres   RB    O
## 5  1380880201240866821 shortage shortage   NN    O
## 6  1380880201240866821    ofand    ofand   NN    O
## 7  1380880201240866821    hence    hence   RB    O
## 8  1380880201240866821     they     they  PRP    O
## 9  1380880201240866821      are       be  VBP    O
## 10 1380880201240866821  showing     show  VBG    O
## 11 1380880201240866821   higher   higher  JJR    O
## 12 1380880201240866821   number   number   NN    O
## 13 1380880201240866821   closer   closer  RBR    O
## 14 1380880201240866821       to       to   IN    O
## 15 1380880201240866821   actual   actual   JJ    O
## 16 1380880201240866821 ofcases. ofcases.   NN    O
## 17 1380880201240866821       In       in   IN    O
## 18 1380880201240866821     2020     2020   CD DATE
## 19 1380880201240866821     they     they  PRP    O
## 20 1380880201240866821     were       be  VBD    O
  • coreNLP_tokens_parser欄位:
    • status_id : 對應原本df裡的status_id,為一則tweets的唯一id
    • word: 原始斷詞
    • lemma : 對斷詞做詞形還原
    • pos : part-of-speech,詞性
    • ner: 命名實體

(2). 命名實體標註(NER)

  • 從NER查看特定類型的實體,辨識出哪幾種類型
unique(tokens$ner)
##  [1] "O"                 "DATE"              "ORDINAL"          
##  [4] "NUMBER"            "ORGANIZATION"      "NATIONALITY"      
##  [7] "COUNTRY"           "CAUSE_OF_DEATH"    "DURATION"         
## [10] "LOCATION"          "PERSON"            "SET"              
## [13] "TITLE"             "STATE_OR_PROVINCE" "CITY"             
## [16] "MISC"              "TIME"              "MONEY"            
## [19] "CRIMINAL_CHARGE"   "IDEOLOGY"          "URL"              
## [22] "PERCENT"           "RELIGION"
#除去entity為Other,有多少種word有被標註entity
length(unique(tokens$word[tokens$ner != "O"])) 
## [1] 2882

(3). 轉小寫

因為大小寫也會影響corenlp對NER的判斷,因此我們一開始給的推文內容是沒有處理大小寫的,但在跑完anotator後,為了正確計算詞頻,創建新欄位lower_word與lower_lemma,存放轉換小寫的word與lemma。轉成小寫的目的是要將不同大小寫的同一字詞都換成小寫,再來計算詞頻。

tokens$lower_word = tolower(tokens$word)
tokens$lower_lemma = tolower(tokens$lemma)

1.4 探索分析 - NER

涉及到的國家(COUNTRY)

我們可以透過coreNLP中的NER解析出在Twitter上面4/6-4/10期間談論新冠肺炎疫苗相關消息,所涉及到的國家(COUNTRY),以初步了解這些議題的主要國家。

tokens %>%
  filter(ner == "COUNTRY") %>%  #篩選NER為COUNTRY
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 13, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is COUNTRY)") +
  theme(text=element_text(size=14))+
  coord_flip()

  • 「印度」:於4/6開始印度第二波新冠肺炎疫情迅速惡化,4/7單日新增病例突破12萬,是繼「美國」之後,第2個單日新增確診數破10萬的國家。讓許多原本對疫苗採觀望態度的45歲以上民眾紛紛登記預約到疫苗施打中心施打,且由於疫苗接種需求大增,導致印度疫情最嚴重的馬哈拉什特拉省(Maharashtra)出現了【疫苗短缺】警訊。
  • 「美國」:美國為疫苗發源地,且從一月開始陸續接種一、二劑疫苗,推測為較多人討論的原因,大家可能會在twitter上分享施打疫苗心得。再來,歐洲藥物管理局EMA於4/9接獲4起民眾接種嬌生疫苗後發生血栓的報告,其中有1例死亡,而嬌生總部位於「美國」。
  • 「英國」:由於疫情逐漸緩和,首相強生(Boris Johnson)4/6表示,將依照原定計畫,非必要的零售商店、美髮沙龍、健身房和戶外餐飲場所等將從4月12日起恢復營業。
涉及到的組織(ORGANIZATION)

我們可以透過coreNLP中的NER解析出在Twitter上面4/6-4/10期間談論新冠肺炎疫苗相關消息,所涉及到的組織(ORGANIZATION),以初步了解這些議題的主要公司/單位。

tokens %>%
  filter(ner == "ORGANIZATION") %>%  #篩選NER為ORGANIZATION
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is ORGANIZATION)") +
  theme(text=element_text(size=14))+
  coord_flip()

  • 「Publix」:民眾可以訪問「Publix Pharmacy」網站預約施打疫苗,且佛羅里達州的所有「Publix」(大型連鎖超市)都向預約的人提供COVID-19疫苗。
  • 「Health」:世界衛生組織 World Health Organization
  • 「AstraZeneca」:阿斯特捷利康製藥公司-歐洲藥物管理局EMA於4/7表示,AstraZeneca COVID-19疫苗之注射,與注射後出現非常罕見的血栓併血小板低下事件,可能有關聯。
  • 「pfizer」:輝瑞大藥廠,由於4/6-4/10期間出現很多疫苗可能產生副作用的議題,故本組猜測以上提到的製藥公司為大家在Twitter上比較各家疫苗好壞的對象。
涉及到的人物(PERSON)

我們可以透過coreNLP中的NER解析出在Twitter上面4/6-4/10期間談論新冠肺炎疫苗相關消息,所涉及到的人物(PERSON),以初步了解這些議題的主要人物。

tokens %>%
  filter(ner == "PERSON") %>%  #篩選NER為PERSON
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is PERSON)") +
  theme(text=element_text(size=14))+
  coord_flip()

  • johnson:英國首相強森或嬌生(Johnson & Johnson)
  • modi:印度總理於4/8表示排除進行第二次全國封城的可能性,支持宵禁並聚焦於微型禁制區。
  • biden:美國總統

1.5 探索分析 - Dependency

語句依存關係結果
dependencies = coreNLP_dependency_parser(obj)
head(dependencies,20)
##              status_id       dep governor governorGloss dependent
## 1  1380880201240866821      ROOT        0          ROOT        21
## 2  1380880201240866821 nmod:poss        2         sense         1
## 3  1380880201240866821     nsubj        6         ofand         2
## 4  1380880201240866821       cop        6         ofand         3
## 5  1380880201240866821    advmod        6         ofand         4
## 6  1380880201240866821  compound        6         ofand         5
## 7  1380880201240866821 parataxis       21  underplaying         6
## 8  1380880201240866821    advmod        6         ofand         7
## 9  1380880201240866821     nsubj       10       showing         8
## 10 1380880201240866821       aux       10       showing         9
## 11 1380880201240866821 parataxis       21  underplaying        10
## 12 1380880201240866821      amod       12        number        11
## 13 1380880201240866821       obj       10       showing        12
## 14 1380880201240866821    advmod       10       showing        13
## 15 1380880201240866821      case       16      ofcases.        14
## 16 1380880201240866821      amod       16      ofcases.        15
## 17 1380880201240866821       obl       13        closer        16
## 18 1380880201240866821      case       18          2020        17
## 19 1380880201240866821       obl       21  underplaying        18
## 20 1380880201240866821     nsubj       21  underplaying        19
##    dependentGloss
## 1    underplaying
## 2              My
## 3           sense
## 4              is
## 5          theres
## 6        shortage
## 7           ofand
## 8           hence
## 9            they
## 10            are
## 11        showing
## 12         higher
## 13         number
## 14         closer
## 15             to
## 16         actual
## 17       ofcases.
## 18             In
## 19           2020
## 20           they
視覺化 Dependency tree
parse_tree <- obj[[113]]$doc[[1]][[1]]$parse
tree <- parse2tree(parse_tree)
SetNodeStyle(tree, style = "filled,rounded", shape = "box")
plot(tree)

1.6 探索分析 - Sentiment

語句情緒值

情緒分數從最低分0~最高分4
+ 0,1 : very negative,negative
+ 2 : neutral
+ 3,4 : very positive,postive

sentiment = coreNLP_sentiment_parser(obj)
head(sentiment,20)
##              status_id
## 1  1380880201240866821
## 2  1380880055425933313
## 3  1380879673429790721
## 4  1380879486422409217
## 5  1380879363323781131
## 6  1380878581803274242
## 7  1379829045169295361
## 8  1380571838518939652
## 9  1380662829313744896
## 10 1380662124058320900
## 11 1380866863194759170
## 12 1380878392245985284
## 13 1380748975054094342
## 14 1380749687385333761
## 15 1380770271703883776
## 16 1380878321395896325
## 17 1380878311069483019
## 18 1380792336200896512
## 19 1380878090038091778
## 20 1380742979783114752
##                                                                                                                                                                                                                                                text
## 1                                                   My sense is theres shortage ofand hence they are showing higher number closer to actual ofcases. In 2020 they were underplaying the covid scene. Covid is the new business opportunity for many
## 2                                                                                                                                                                                                                                       first dose.
## 3                                                                              Got the second COVID shot. I hydrated and took a mega dose of vitamins to stave off symptoms. It mostly worked. Is the erection supposed to last more than 10 hours?
## 4                                                      wont requirepassport because it violates peoples rights and freedoms and may result in discrimination yet we deny students access to school everyday for not showing proof of their vaccines
## 5                          46 of my research team are COVID positive. All had received one dose of vaccine. Two had both doses  one is positive and other negative. All have mild symptomsasymptomatic. Take vaccine  will save u from Severe COVID
## 6                                                     The largest Covid19 vaccination centre in thewas put through its paces on Friday but it will not be fully operational until May as the country badly hit by the pandemic waits for moreshots.
## 7                        The World Health Organizations advisorysafety panel said on Wednesday a causal link between the AstraZeneca Covid19 vaccine and rare cases of blood clots with low platelets is considered plausible but is not confirmed.
## 8                                                                                  Pfizer Inc and German partner BioNtech SE on Friday said they have requested US regulatory agencies to expand the emergency use of their Covid19toaged 12 to 15.
## 9                          It is fiendishly challenging to prove that an adverse event following immunization was caused by theitself. The Nature Coronapodexplores why it is so hard to investigate the rare side effects of COVID vaccines. 13min
## 10             On Wednesday the European Medicines Agency concluded that the OxfordAstraZeneca COVID19is possibly linked to very rare blood clots accompanied by low levels of blood platelets. The finding leaveswrestling with a medical mystery.
## 11                                                                      TheNational Biotec Group Company CNBG has obtained regulatory approval to move a third Covid19candidate into the human testing stage CNBG said on Saturday. Reuters reports
## 12                                                                                                      and 'scauses' thrombosis with thrombocytopenia syndrome TTS a severe condition with blood clots in the brain or the other parts of the body
## 13                                                                                                                                   prefersoverand 's ATAGI has noted further evidence of rare but severe thrombosis with the COVID19 vaccine of .
## 14                                                                                                                                      prefersover 's ATAGI has noted further evidence of rare but severe thrombosis with the COVID19 vaccine of .
## 15                                                                                                                                                           In  ATAGI has recommended the use of 'svaccine overand 'sin adults aged under 50 years
## 16 I was scared to go to a massive stadium to get my first shot especially after a year of avoiding crowds and with the Covid19 surge in Michigan.It actually turned out to be a very positive experience and Im thankful for everyone working at !
## 17                                                                                                                                                                                                        How I feel after getting my first dose of
## 18                                                                                                                                                                              Possible COVID19appointments detected at CVS in ! Try scheduling at
## 19                                                                                                                                                                                     COVID19appointments available both at CVS and Walgreens now!
## 20                                                                                                                                                                        Possible COVID19appointments detected at CVS in Austin! Try scheduling at
##    sentiment sentimentValue
## 1   Positive              3
## 2    Neutral              2
## 3   Negative              1
## 4   Negative              1
## 5   Positive              3
## 6   Negative              1
## 7   Positive              3
## 8    Neutral              2
## 9   Positive              3
## 10   Neutral              2
## 11   Neutral              2
## 12  Negative              1
## 13   Neutral              2
## 14   Neutral              2
## 15   Neutral              2
## 16  Negative              1
## 17   Neutral              2
## 18   Neutral              2
## 19   Neutral              2
## 20   Neutral              2
資料集中的情緒種類
unique(sentiment$sentiment)
## [1] "Positive"     "Neutral"      "Negative"     "Verynegative" "Verypositive"
sentiment$sentimentValue = sentiment$sentimentValue %>% as.numeric
#了解情緒文章的分佈
sentiment$sentiment %>% table()
## .
##     Negative      Neutral     Positive Verynegative Verypositive 
##         1228         2890          762            3            2
平均情緒分數時間趨勢
df$date = as.Date(df$created_at)

sentiment %>% 
  merge(df[,c("status_id","source","date")]) %>%
  group_by(date) %>% 
  summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>% 
  ggplot(aes(x=date,y=avg_sentiment)) + 
  geom_line()


4/9韓國和伊朗單日新增確診病例創新高,波蘭染疫死亡人數也創單日新高。且於同日歐洲聯盟與英國主管機關發現AZ疫苗與出現罕見腦部血栓可能有關,歐盟稱應將血栓列為罕見副作用,打AZ疫苗益處仍大過風險。英、義、西、比等國7日對接種年齡提供建議或限制。本組推測由於4/9出現較多有關新冠肺炎疫苗的負面消息,故當天的情緒較其他天低落,情緒有下降的趨勢。

不同用戶端情緒時間趨勢
sentiment %>% 
  merge(df[,c("status_id","source","date")]) %>%
  filter(source %in% c("Twitter Web Client","Twitter for iPhone","Twitter for Android")) %>% 
  group_by(date,source) %>% 
  summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>% 
  ggplot(aes(x=date,y=avg_sentiment,color=source)) + 
  geom_line()
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

了解情緒分佈,以及在正面情緒及負面情緒下,所使用的文章詞彙為何?
#了解正面文章的詞彙使用
sentiment %>% 
  merge(tokens) %>% 
  anti_join(stop_words) %>% 
  filter(!lower_word %in% c('i','the')) %>% 
  filter(sentiment == "Verypositive" | sentiment =='Positive') %>%
  group_by(lower_lemma) %>% #根據lemma分組
  summarize(count = n()) %>% 
  filter(count >30)%>%
  wordcloud2()
## Joining, by = "word"
  • 出現「health」、「safe」、「expert」等等
#了解負面文章的詞彙使用
sentiment %>% 
  merge(tokens) %>% 
  anti_join(stop_words) %>% 
  filter(!lower_word %in% c('i','the')) %>% 
  filter(sentiment == "Verynegative" | sentiment =='Negative') %>%
  group_by(lower_lemma) %>% 
  summarize(count = n()) %>% 
  filter(count >30)%>%
  wordcloud2()

“wordcloud”

  • 出現「death」、「blood」、「hillsborough」等等


其中,「hillsborough」為希爾斯堡,於4/8北卡羅來納州公衛官員表示,因至少26名民眾接種嬌生疫苗後發生包括暈厥等不適現象,北卡洛麗(Raleigh)的大型新冠疫苗接種中心、希爾斯堡(Hillsborough)、、教堂山(Chapel Hill)診所已暫停接種該款疫苗。

2. Sentimentr 英文情緒分析

library(sentimentr)
## Warning: package 'sentimentr' was built under R version 4.0.5
mytext <- get_sentences(tweets$text) #將text轉成list of characters型態
#物件,將character向量轉成list,list裡放著character向量(已斷句)
每個文本的情緒分數

情緒分數為-1~1之間,<0屬於負面,>0屬於正面,0屬於中性

sentiment_by(mytext) #document level
##       element_id word_count         sd ave_sentiment
##    1:          1         31 0.29389508    0.13137237
##    2:          2          2         NA    0.00000000
##    3:          3         30 0.14124369   -0.03796108
##    4:          4         29         NA    0.13927150
##    5:          5         36 0.13099047    0.16971437
##   ---                                               
## 4907:       4907         38 0.29581381   -0.27037167
## 4908:       4908         38 0.38248440    0.23511543
## 4909:       4909         36 0.65423148    0.28797082
## 4910:       4910          4         NA    0.00000000
## 4911:       4911         44 0.05462869    0.21540501
每個句子的情緒分數
sentiment(mytext) #sentence level
##       element_id sentence_id word_count  sentiment
##    1:          1           1         16 -0.0875000
##    2:          1           2          7  0.0000000
##    3:          1           3          8  0.4596194
##    4:          2           1          2  0.0000000
##    5:          3           1          5 -0.1788854
##   ---                                             
## 8680:       4909           2         12  0.0000000
## 8681:       4909           3          2  1.0182338
## 8682:       4910           1          4  0.0000000
## 8683:       4911           1         36  0.2540333
## 8684:       4911           2          8  0.1767767
  • 回傳4個欄位的dataframe:
    • element_id – 第幾個文本
    • sentence_id – 該文本中的第幾個句子
    • word_count – 句子字數
    • sentiment – 句子的情緒分數

2.2 使用twitter資料實踐在sentimentr

計算tweet中屬於正面的字
set.seed(10)
x <- sample(tweets$text, 1000, replace = FALSE) #隨機取1000筆,取後不放回
sentiment_words <- extract_sentiment_terms(x) #抓取其中帶有情緒的字
sentiment_counts <- attributes(sentiment_words)$counts #計算出現次數
sentiment_counts[polarity > 0,]   #正面的字
##            words polarity  n
##   1:      please      1.0 10
##   2:        care      1.0  7
##   3:    benefits      1.0  6
##   4: appropriate      1.0  5
##   5:      equity      1.0  4
##  ---                        
## 442:   checklist      0.1  1
## 443:      prefer      0.1  1
## 444:     reading      0.1  1
## 445:    building      0.1  1
## 446:         sex      0.1  1
計算tweet中屬於負面的字
sentiment_counts[polarity < 0,] %>% arrange(desc(n)) %>% top_n(10) #出現次數最多的負面字
## Selecting by n
##          words polarity  n
##  1:       shot    -0.40 32
##  2:       risk    -0.75 32
##  3: government    -0.50 18
##  4:      virus    -0.50 16
##  5:      limit    -0.25 15
##  6:   pandemic    -1.00 15
##  7:       stop    -0.40 14
##  8:      death    -0.75 11
##  9:       clot    -0.40 10
## 10:    adverse    -0.50 10
## 11:   shortage    -0.75 10
highlight每個句子,判斷屬於正/負面
set.seed(12)
df%>%
    filter(status_id %in% sample(unique(status_id), 30)) %>% #隨機30筆貼文
    mutate(review = get_sentences(text)) %$% 
    sentiment_by(review, status_id) %>%
    highlight()
## Saved in C:\Users\ooolivia\AppData\Local\Temp\RtmpCUdA2m/polarity.html
## Opening C:\Users\ooolivia\AppData\Local\Temp\RtmpCUdA2m/polarity.html ...

2.3 用日期來了解情緒波動

code 參考 https://github.com/trinker/sentimentr

tweets$date = format(tweets$created_at,'%Y%m%d')

(out  = tweets  %>%  with(
    sentiment_by( #document level
        get_sentences(text), 
        list( date)
    )
))
plot(out)

  • 4/7-4/9 Twitter討論文章內有較多的情緒產生
  • 到4/10,Twitter文章內的情緒慢慢從微正向往微負向靠近。

2.4 用日期來了解不同用戶端情的緒波動

(out  = tweets %>% filter(source %in% c("Twitter Web Client","Twitter for iPhone","Twitter for Android")) %>%  with(
    sentiment_by(
        get_sentences(text), 
        list(source, date)
    )
))
plot(out)

  • 使用Twitter的Android用戶於4/6-4/10情緒普遍比iPhone用戶來得低落。

組內小總結

4/6-4/10有許多較不利於疫苗的消息出現,故在Twitter上就會出現比較各家廠牌疫苗效用、優缺點的文章。且隨著疫情在國外愈來愈嚴重,增加了民眾施打疫苗的意願,印度也傳出了疫苗短缺的消息。本組覺得這次議題民眾普遍情緒偏向負的比較多,原因是大家希望疫苗能起到預防的作用,但於我們探索期間,傳出與疫苗副作用的消息偏多,不免讓人擔心施打疫苗的風險,給予一種疫苗的出現是希望但也是小失望的感覺。

助教小總結

coreNLP

  1. 找出議題核心人物,組織,國家
  2. 用句法學的分析找出句子相依關係
  3. 分別找出正、負面文章的常用字

sentimentr

  1. 找到tweets中正負面的詞,並且計算每個文本中屬於正負面的句子有哪些
  2. 根據日期知道情緒的波動、不同用戶端的波動