基本介紹

  • 目的:使用coreNLP與sentimentr分析twitter上貨櫃航擱淺在蘇伊士運河的文字資料
  • 概述:貨櫃「長賜號」巨型貨輪3月21日在埃及蘇伊士運河擱淺,橫卡河道,阻斷了往來交通,經各方努力,29日脫離嵌入的堤岸恢復自由。希望利用推特的文章分析民眾對“蘇伊士危機”的情緒。
  • 資料來源:Twitter,3/27~4/4,5000筆,English

1. coreNLP

安裝package

1.1 資料收集:tweets

(1). Twitter API設定 透過rtweet抓取tweets

(2). 設定關鍵字抓tweets

(3). tweets內容清理

## # A tibble: 6 x 90
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 441208… 13787921… 2021-04-04 19:30:31 BorjeMelin  Simo… Twitt…
## 2 441208… 13763228… 2021-03-28 23:58:39 BorjeMelin  Traf… Twitt…
## 3 441208… 13787731… 2021-04-04 18:15:10 BorjeMelin  To b… Twitt…
## 4 441208… 13783128… 2021-04-03 11:45:55 BorjeMelin  Upda… Twitt…
## 5 322805… 13787823… 2021-04-04 18:51:55 Deveshjais… SEUZ… Twitt…
## 6 322805… 13787823… 2021-04-04 18:51:41 Deveshjais… SEUZ… Twitt…
## # … with 84 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
## #   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## #   is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## #   quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
## #   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
## #   media_url <list>, media_t.co <list>, media_expanded_url <list>,
## #   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
## #   ext_media_expanded_url <list>, ext_media_type <chr>,
## #   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
## #   quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
## #   quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

df共有90個欄位,但我們在這裡僅會使用幾個欄位:

  • user_id: 用戶id
  • status_id : 推文id
  • created_at : 發文時間
  • text : 推文內容
  • source : 發文來源

了解資料的資料筆數以及時間分布

created_at已經是一個date類型的欄位,因此可以直接用min,max來看最遠或最近的日期
註:rtweet最多只能抓到距今10天的資料

## [1] 4958
## [1] "2021-03-27 18:35:51 UTC"
## [1] "2021-04-04 19:30:31 UTC"

1-2串接CoreNLP API

(1). API呼叫的設定

server端 : + 需先在terminal開啟corenlp server + 在corenlp的路徑下開啟terminal輸入 java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

(2). 資料整理function

從回傳的object中整理斷詞出結果,輸出為 tidydata 格式

從回傳的core-nlp object中整理出詞彙依存關係,輸出為 tidydata 格式

從回傳的core-nlp object中整理出語句情緒,輸出為 tidydata 格式

1-3 提取結果

(1). 斷詞、詞彙還原、詞性標註、NER

##              status_id        word       lemma pos    ner
## 1  1378792114037870600       Simon       Simon NNP PERSON
## 2  1378792114037870600      Parkes      Parkes NNP PERSON
## 3  1378792114037870600      Update      Update NNP      O
## 4  1378792114037870600        With        with  IN      O
## 5  1378792114037870600       great       great  JJ      O
## 6  1378792114037870600      regret      regret  NN      O
## 7  1378792114037870600           I           I PRP      O
## 8  1378792114037870600        must        must  MD      O
## 9  1378792114037870600    announce    announce  VB      O
## 10 1378792114037870600        that        that  IN      O
## 11 1378792114037870600    children       child NNS      O
## 12 1378792114037870600        were          be VBD      O
## 13 1378792114037870600       being          be VBG      O
## 14 1378792114037870600 transported   transport VBN      O
## 15 1378792114037870600          in          in  IN      O
## 16 1378792114037870600       cargo       cargo  NN      O
## 17 1378792114037870600 containers. containers.  NN      O
## 18 1378792114037870600         The         the  DT      O
## 19 1378792114037870600   operation   operation  NN      O
## 20 1378792114037870600          in          in  IN      O
  • coreNLP_tokens_parser欄位:
    • status_id : 對應原本df裡的status_id,為一則tweets的唯一id
    • word: 原始斷詞
    • lemma : 對斷詞做詞形還原
    • pos : part-of-speech,詞性
    • ner: 命名實體

(2). 命名實體標註(NER)

  • 從NER查看特定類型的實體,辨識出哪幾種類型
##  [1] "PERSON"            "O"                 "CITY"             
##  [4] "LOCATION"          "NUMBER"            "ORDINAL"          
##  [7] "ORGANIZATION"      "TITLE"             "MISC"             
## [10] "DATE"              "DURATION"          "MONEY"            
## [13] "PERCENT"           "COUNTRY"           "TIME"             
## [16] "NATIONALITY"       "SET"               "STATE_OR_PROVINCE"
## [19] "CAUSE_OF_DEATH"    "CRIMINAL_CHARGE"   "IDEOLOGY"         
## [22] "RELIGION"
## [1] 1880

(3). 轉小寫

因為大小寫也會影響corenlp對NER的判斷,因此我們一開始給的推文內容是沒有處理大小寫的,但在跑完anotator後,為了正確計算詞頻,創建新欄位lower_word與lower_lemma,存放轉換小寫的word與lemma。轉成小寫的目的是要將不同大小寫的同一字詞(如Evergiven與evergiven)都換成小寫,再來計算詞頻

1.4 探索分析 - NER

涉及到的國家(COUNTRY)

我們可以透過coreNLP中的NER解析出在Twitter上面談論長賜號擱淺蘇伊士運河,所涉及到的國家(COUNTRY),以初步了解這個議題的主要國家。

  • 在「埃及」蘇伊士運河堵住了各地的船
  • 長賜輪是「印度」船長負責駕駛,船則是「台灣」公司從「日本」船東租來,由「英國」負責保險,並掛著「巴拿馬」國旗;船上所載貨物屬於「中國」,往「荷蘭」前進
涉及到的組織(ORGANIZATION)

我們可以透過coreNLP中的NER解析出在Twitter上面談論長賜號擱淺蘇伊士運河,所涉及到的組織(ORGANIZATION),以初步了解這個議題的主要公司/單位。

  • 蘇伊士運河管理局 Suez Canal Authority (SCA)
  • 船運代理公司Inchcape
  • 長榮Evergreen
涉及到的人物(PERSON)

我們可以透過coreNLP中的NER解析出在Twitter上面談論長賜號擱淺蘇伊士運河,所涉及到的人物(PERSON),以初步了解這個議題的主要人物。

  • sisi:現任埃及總統
  • Ossama Rabei: 蘇伊士運河管理局(Suez Canal Authority)局長

1.5 探索分析 - Dependency

語句依存關係結果
##              status_id        dep governor governorGloss dependent
## 1  1378792114037870600       ROOT        0          ROOT        26
## 2  1378792114037870600   compound        3        Update         1
## 3  1378792114037870600   compound        3        Update         2
## 4  1378792114037870600   obl:tmod       26       success         3
## 5  1378792114037870600       case        6        regret         4
## 6  1378792114037870600       amod        6        regret         5
## 7  1378792114037870600        obl        9      announce         6
## 8  1378792114037870600      nsubj        9      announce         7
## 9  1378792114037870600        aux        9      announce         8
## 10 1378792114037870600  acl:relcl        3        Update         9
## 11 1378792114037870600       mark       14   transported        10
## 12 1378792114037870600 nsubj:pass       14   transported        11
## 13 1378792114037870600        aux       14   transported        12
## 14 1378792114037870600   aux:pass       14   transported        13
## 15 1378792114037870600      ccomp        9      announce        14
## 16 1378792114037870600       case       17   containers.        15
## 17 1378792114037870600   compound       17   containers.        16
## 18 1378792114037870600        obl       14   transported        17
## 19 1378792114037870600        det       19     operation        18
## 20 1378792114037870600      nsubj       26       success        19
##    dependentGloss
## 1         success
## 2           Simon
## 3          Parkes
## 4          Update
## 5            With
## 6           great
## 7          regret
## 8               I
## 9            must
## 10       announce
## 11           that
## 12       children
## 13           were
## 14          being
## 15    transported
## 16             in
## 17          cargo
## 18    containers.
## 19            The
## 20      operation
視覺化 Dependency tree

1.6 探索分析 - Sentiment

語句情緒值

情緒分數從最低分0~最高分4
+ 0,1 : very negative,negative
+ 2 : neutral
+ 3,4 : very positive,postive

##              status_id
## 1  1378792114037870600
## 2  1376322876274532359
## 3  1378773153304956930
## 4  1378312805724602368
## 5  1378782399681589249
## 6  1378782341825372160
## 7  1378780519807193093
## 8  1376524772608196609
## 9  1376616766047186951
## 10 1376256378986201088
## 11 1378772889982410766
## 12 1378772308693688321
## 13 1378405661877403649
## 14 1375905467482931202
## 15 1376599080198103045
## 16 1376487077441855491
## 17 1378564346498928642
## 18 1376294537027584000
## 19 1376352746593456129
## 20 1376563871012507649
##                                                                                                                                                                                                     text
## 1  Simon Parkes Update With great regret I must announce that children were being transported in cargo containers. The operation in the Suez Canal was a success and warrants me to do a video update...
## 2                                                                                                                                                      Traffic jam at the Suez Canal. About 350 vessels.
## 3                                                                                                                                             To be clear The USS Dwight D. Eisenhower Aircraft Carrier.
## 4                                                                                                                                                                                                 Update
## 5                                                                                                                                                SEUZ CANAL BLOCKAGE  HOW SHIP FROM SEUZ CANAL IN DETAIL
## 6                                                                                                                                                SEUZ CANAL BLOCKAGE  HOW SHIP FROM SEUZ CANAL IN DETAIL
## 7                                 Who knows why the Asia Ruby III is still waiting to enter the? She was tugged out after thewas stranded and normally should have been one of the first vessels to pass
## 8                                                                                                                                                                                 Ever Given is moving !
## 9                                                                                                                        Good to see that Suez Canal Authorities give priority to vessels with livestock
## 10                                                                                                                                                  Yes ! The Dutch have arrived !! Salvation is near !!
## 11                                                                                                                    Can I name mine the Suez Canal because she's too shallow for the ? I hope it's not
## 12  Do you miss the fun ofblocking the Suez? The last time it happened was even weirder I really loved reading about the Great Bitter Lake Association via  this piece onis wonderful and full of photos
## 13                                                                                                                 The owner of the shipfiles a lawsuit against its operator for delinquency in theCanal
## 14                                                                                                                    TheAuthority reveals the case of.. and explains the story of the bulldozers photos
## 15                                                                                                              International newspapers after the float one of the largest rescue operations in history
## 16                                                                                                                                             How the shipgot stuck in thethrough satellite A new video
## 17                                                                                                                 The owner of the shipfiles a lawsuit against its operator for delinquency in theCanal
## 18                                                                                                                                       How the shipgot stuck in thethrough satellite A very new video 
## 19                                                                                                                                                  How the shipgot stuck in thethroughA very new video 
## 20                                                                                                                                                        The joy of the crew of theship as it exits the
##    sentiment sentimentValue
## 1   Positive              3
## 2    Neutral              2
## 3    Neutral              2
## 4    Neutral              2
## 5    Neutral              2
## 6    Neutral              2
## 7   Negative              1
## 8   Positive              3
## 9    Neutral              2
## 10  Positive              3
## 11  Negative              1
## 12  Positive              3
## 13   Neutral              2
## 14   Neutral              2
## 15   Neutral              2
## 16   Neutral              2
## 17   Neutral              2
## 18   Neutral              2
## 19  Positive              3
## 20   Neutral              2
資料集中的情緒種類
## [1] "Positive"     "Neutral"      "Negative"     "Verypositive" "Verynegative"
## .
##     Negative      Neutral     Positive Verynegative Verypositive 
##         1180         2875          869            2           12
了解情緒分佈,以及在正面情緒及負面情緒下,所使用的文章詞彙為何?
## Joining, by = "word"
“wordcloud”

“wordcloud”

2. Sentimentr 英文情緒分析

2.1 簡介sentimentr

每個文本的情緒分數

情緒分數為-1~1之間,<0屬於負面,>0屬於正面,0屬於中性

##    element_id word_count       sd ave_sentiment
## 1:          1         10 1.497465    -0.8088680
## 2:          2          5       NA     0.5813777
## 3:          3          9 0.284605     0.2196345
每個句子的情緒分數
##    element_id sentence_id word_count  sentiment
## 1:          1           1          4  0.2500000
## 2:          1           2          6 -1.8677359
## 3:          2           1          5  0.5813777
## 4:          3           1          5  0.4024922
## 5:          3           2          4  0.0000000
  • 回傳4個欄位的dataframe:
    • element_id – 第幾個文本
    • sentence_id – 該文本中的第幾個句子
    • word_count – 句子字數
    • sentiment – 句子的情緒分數

2.2 使用twitter資料實踐在sentimentr

計算tweet中屬於正面的字
##             words polarity  n
##   1:       please      1.0 47
##   2: successfully      1.0 30
##   3:    brilliant      1.0  3
##   4:      quickly      1.0  3
##   5:       master      1.0  3
##  ---                         
## 335: considerable      0.1  1
## 336:       sermon      0.1  1
## 337:      prepare      0.1  1
## 338:      reading      0.1  1
## 339:       expand      0.1  1
計算tweet中屬於負面的字
## Selecting by n
##       words polarity   n
## 1     stuck    -0.25 114
## 2  blocking    -0.50  65
## 3  stranded    -1.00  34
## 4   blocked    -0.40  25
## 5    crisis    -0.75  24
## 6    bitter    -0.50  21
## 7  blockage    -0.60  18
## 8       jam    -1.00  17
## 9   aground    -0.50  15
## 10 grounded    -0.25  10
## 11  problem    -0.75  10
highlight每個句子,判斷屬於正/負面
## Saved in /tmp/RtmpENlj5O/polarity.html
## Opening /tmp/RtmpENlj5O/polarity.html ...

總結

coreNLP

  1. 找出議題核心人物,組織,國家
  2. 用句法學的分析找出句子相依關係
  3. 分別找出正、負面文章的常用字

sentimentr

  1. 找到tweets中正負面的詞,並且計算每個文本中屬於正負面的句子有哪些
  2. 根據日期知道情緒的波動、不同用戶端的波動

Practice

以讀書會為單位,針對有興趣的議題分析資料,不限定資料來源,可以是tweet或文字分析平台上的文字資料,練習用coreNLP或sentimentr來分析,最後將作業轉成RPubs發布,並將連結上傳至網大「第7週HW」,每組一人上傳即可。