test-0405.utf8

社群媒體分析-CoreNLP與sentimentr練習(Twitter API申請) 何明信

2021/4/10

基本介紹目的:使用coreNLP與sentimentr分析twitter上台灣與台積電的文字資料概述:隨著中美貿易站開始，台灣的台積電在全球晶片產業上佔有舉足輕重的地位。希望利用推特的文章分析了解全球民眾對“台灣及台積電”的認知與相關情緒。資料來源:Twitter，3/27~4/4，5000筆，English 1. coreNLP 安裝JAVA JRE 1.8+ https://www.java.com/zh_TW/ 下載Stanford coreNLP full模組安裝package

packages = c("dplyr","ggplot2","rtweet" ,"xml2", "httr", "jsonlite", "data.tree", "NLP", "igraph","sentimentr","tidytext","wordcloud2","DiagrammeR","dplyr")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

library(wordcloud2)
library(ggplot2)
library(scales)
library(rtweet)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(xml2)
library(httr)
library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:rtweet':
## 
##     flatten

library(magrittr)
library(data.tree)
library(tidytext)
library(stringr)
library(DiagrammeR)
library(magrittr)

load("~/coreNLP_all/.RData")

1.1 資料收集：tweets (1). Twitter API設定透過rtweet抓取tweets

app = 'heminghsin'
consumer_key = 'kjJGO9cTWdoCG9BHgBwfFtcfi'
consumer_secret = 'zjamiuhUuWZjjbsi01Jlg38uwVyXeQpiuFkUDk6QhiSIC379UO'
access_token = '1380751997343698949-qYevxQu1xqqP4dYegz4VeZUzSASi44'
access_secret = 'Lnbft6o7jK2Du8X9087qTFRRhs3UA2coYJSRi4KnrbR9d'
twitter_token <- create_token(app,consumer_key, consumer_secret,
                    access_token, access_secret,set_renv = FALSE)

(2). 設定關鍵字抓tweets

# 查詢關鍵字
key = c("#Taiwan")
context = "TSMC"
q = paste(c(key,context),collapse=" OR ")   
# 查詢字詞 "#Taiwan OR TSMC"
# 為了避免只下#Taiwan和TSMC同時出現的條件下資料筆數過少    選擇OR條件

#抓5000筆 不抓轉推
tweets = search_tweets(q,lang="en",n=5000,include_rts = FALSE,token = twitter_token)

(3). tweets內容清理

## 用於資料清理
clean = function(txt) {
  txt = iconv(txt, "latin1", "ASCII", sub="") #改變字的encoding
  txt = gsub("(@|#)\\w+", "", txt) #去除@或#後有數字,字母,底線 (標記人名或hashtag)
  txt = gsub("(http|https)://.*", "", txt) #去除網址(.:任意字元，*:0次以上)
  txt = gsub("[ \t]{2,}", "", txt) #去除兩個以上空格或tab
  txt = gsub("\\n"," ",txt) #去除換行
  txt = gsub("\\s+"," ",txt) #去除一個或多個空格(+:一次以上)
  txt = gsub("^\\s+|\\s+$","",txt) #去除開頭/結尾有一個或多個空格
  txt = gsub("&.*;","",txt) #去除html特殊字元編碼
  txt = gsub("[^a-zA-Z0-9?!. ']","",txt) #除了字母,數字空白?!.的都去掉(表情符號去掉)
  txt }


tweets$text = clean(tweets$text)  #text套用資料清理

df = data.frame()
  
df = rbind(df,tweets)  # transfer to data frame

df = df[!duplicated(df[,"status_id"]),]  #去除重複的tweets

head(df)

## # A tibble: 6 x 90
##   user_id status_id created_at          screen_name text  source
##   <chr>   <chr>     <dttm>              <chr>       <chr> <chr> 
## 1 121796… 13811071… 2021-04-11 04:49:27 thittracer… Than… Twitt…
## 2 121796… 13785815… 2021-04-04 05:33:52 thittracer… All … Twitt…
## 3 121796… 13805571… 2021-04-09 16:24:10 thittracer… We p… Twitt…
## 4 121796… 13801401… 2021-04-08 12:46:59 thittracer… Than… Twitt…
## 5 135753… 13811060… 2021-04-11 04:45:22 kyikyizaw2… Than… Twitt…
## 6 134897… 13811060… 2021-04-11 04:45:07 lynx_ivie   Than… Twitt…
## # … with 84 more variables: display_text_width <dbl>, reply_to_status_id <chr>,
## #   reply_to_user_id <chr>, reply_to_screen_name <chr>, is_quote <lgl>,
## #   is_retweet <lgl>, favorite_count <int>, retweet_count <int>,
## #   quote_count <int>, reply_count <int>, hashtags <list>, symbols <list>,
## #   urls_url <list>, urls_t.co <list>, urls_expanded_url <list>,
## #   media_url <list>, media_t.co <list>, media_expanded_url <list>,
## #   media_type <list>, ext_media_url <list>, ext_media_t.co <list>,
## #   ext_media_expanded_url <list>, ext_media_type <chr>,
## #   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>,
## #   quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
## #   quoted_source <chr>, quoted_favorite_count <int>,
## #   quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>,
## #   quoted_name <chr>, quoted_followers_count <int>,
## #   quoted_friends_count <int>, quoted_statuses_count <int>,
## #   quoted_location <chr>, quoted_description <chr>, quoted_verified <lgl>,
## #   retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>,
## #   retweet_source <chr>, retweet_favorite_count <int>,
## #   retweet_retweet_count <int>, retweet_user_id <chr>,
## #   retweet_screen_name <chr>, retweet_name <chr>,
## #   retweet_followers_count <int>, retweet_friends_count <int>,
## #   retweet_statuses_count <int>, retweet_location <chr>,
## #   retweet_description <chr>, retweet_verified <lgl>, place_url <chr>,
## #   place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>,
## #   country_code <chr>, geo_coords <list>, coords_coords <list>,
## #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>,
## #   description <chr>, url <chr>, protected <lgl>, followers_count <int>,
## #   friends_count <int>, listed_count <int>, statuses_count <int>,
## #   favourites_count <int>, account_created_at <dttm>, verified <lgl>,
## #   profile_url <chr>, profile_expanded_url <chr>, account_lang <lgl>,
## #   profile_banner_url <chr>, profile_background_url <chr>,
## #   profile_image_url <chr>

df共有90個欄位，但我們在這裡僅會使用幾個欄位:

user_id: 用戶id status_id : 推文id created_at : 發文時間 text : 推文內容 source : 發文來源

了解資料的資料筆數以及時間分布

created_at已經是一個date類型的欄位，因此可以直接用min,max來看最遠或最近的日期註:rtweet最多只能抓到距今10天的資料

nrow(df)

## [1] 5000

min(df$created_at)

## [1] "2021-04-03 12:59:11 UTC"

max(df$created_at)

## [1] "2021-04-11 04:49:27 UTC"

1-2串接CoreNLP API (1). API呼叫的設定

server端 : + 需先在terminal開啟corenlp server + 在corenlp的路徑下開啟terminal輸入 java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

# 產生coreNLP的api url，將本地端的網址轉成符合coreNLP服務的url
generate_API_url <- function(host, port="9000",
                    tokenize.whitespace="false", annotators=""){ #斷詞依據不是空格
    url <- sprintf('http://%s:%s/?properties={"tokenize.whitespace":"%s","annotators":"%s"}', host, port, tokenize.whitespace, annotators)
    url <- URLencode(url)
}
#指定服務的位置
host = "140.117.79.167"

generate_API_url(host)

# 呼叫coreNLP api
call_coreNLP <- function(server_host, text, host="localhost", language="eng",
                    tokenize.whitespace="true", ssplit.eolonly="true", annotators=c("tokenize","ssplit","pos","lemma","ner","parse","sentiment")){
  # 假設有兩個core-nlp server、一個負責英文（使用9000 port）、另一個則負責中文（使用9001 port）
  port <- ifelse(language=="eng", 9000, 9001);
  # 產生api網址
  url <- generate_API_url(server_host, port=port,
                    tokenize.whitespace=tokenize.whitespace, annotators=paste0(annotators, collapse = ','))
  
  result <- POST(url, body = text, encode = "json")
  doc <- httr::content(result, "parsed","application/json",encoding = "UTF-8")
  return (doc)
}

#文件使用coreNLP服務
coreNLP <- function(data,host){
  # 依序將每個文件丟進core-nlp進行處理，每份文件的回傳結果為json格式
  # 在R中使用objects來儲存處理結果
  result <- apply(data, 1 , function(x){
    object <- call_coreNLP(host, x['text'])
    list(doc=object, data=x)
  })
  
  return(result)
}

(2). 資料整理function

從回傳的object中整理斷詞出結果，輸出為 tidydata 格式

coreNLP_tokens_parser <- function(coreNLP_objects){
  
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
   
    sen <- sentences[[1]]
    
    tokens <- do.call(rbind, lapply(sen$tokens, function(x){
      result <- data.frame(word=x$word, lemma=x$lemma, pos=x$pos, ner=x$ner)
      result
    }))
    
    tokens <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(tokens))) %>% 
      bind_cols(tokens)
    
    tokens
  }))
  return(result)
}

從回傳的core-nlp object中整理出詞彙依存關係，輸出為 tidydata 格式

coreNLP_dependency_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    sen <- sentences[[1]]
    dependencies <- do.call(rbind, lapply(sen$basicDependencies, function(x){
      result <- data.frame(dep=x$dep, governor=x$governor, governorGloss=x$governorGloss, dependent=x$dependent, dependentGloss=x$dependentGloss)
      result
    }))
  
    dependencies <- original_data %>%
      t() %>% 
      data.frame() %>% 
      select(-text) %>% 
      slice(rep(1:n(), each = nrow(dependencies))) %>% 
      bind_cols(dependencies)
    dependencies
  }))
  return(result)
}

從回傳的core-nlp object中整理出語句情緒，輸出為 tidydata 格式

coreNLP_sentiment_parser <- function(coreNLP_objects){
  result <- do.call(rbind, lapply(coreNLP_objects, function(obj){
    original_data <- obj$data
    doc <- obj$doc
    # for a sentences
    sentences <- doc$sentences
    sen <- sentences[[1]]
    
    sentiment <- original_data %>%
      t() %>% 
      data.frame() %>% 
      bind_cols(data.frame(sentiment=sen$sentiment, sentimentValue=sen$sentimentValue))
  
    sentiment
  }))
  return(result)
}

圖形化 Dependency tree

程式參考來源：https://stackoverflow.com/questions/35496560/how-to-convert-corenlp-generated-parse-tree-into-data-tree-r-package

# 圖形化顯示dependency結果
parse2tree <- function(ptext) {
  stopifnot(require(NLP) && require(igraph))
  
  # this step modifies coreNLP parse tree to mimic openNLP parse tree
  ptext <- gsub("[\r\n]", "", ptext)
  ptext <- gsub("ROOT", "TOP", ptext)


  ## Replace words with unique versions
  ms <- gregexpr("[^() ]+", ptext)                                      # just ignoring spaces and brackets?
  words <- regmatches(ptext, ms)[[1]]                                   # just words
  regmatches(ptext, ms) <- list(paste0(words, seq.int(length(words))))  # add id to words
  
  ## Going to construct an edgelist and pass that to igraph
  ## allocate here since we know the size (number of nodes - 1) and -1 more to exclude 'TOP'
  edgelist <- matrix('', nrow=length(words)-2, ncol=2)
  
  ## Function to fill in edgelist in place
  edgemaker <- (function() {
    i <- 0                                       # row counter
    g <- function(node) {                        # the recursive function
      if (inherits(node, "Tree")) {            # only recurse subtrees
        if ((val <- node$value) != 'TOP1') { # skip 'TOP' node (added '1' above)
          for (child in node$children) {
            childval <- if(inherits(child, "Tree")) child$value else child
            i <<- i+1
            edgelist[i,1:2] <<- c(val, childval)
          }
        }
        invisible(lapply(node$children, g))
      }
    }
  })()
  
  ## Create the edgelist from the parse tree
  edgemaker(Tree_parse(ptext))
  tree <- FromDataFrameNetwork(as.data.frame(edgelist))
  return (tree)
}

將句子丟入服務

取得coreNLP回傳的物件先不要跑這段，會花大概半小時（如果你記憶體只有4G可能會當掉）

gc() #釋放不使用的記憶體

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  5726450 305.9   10891302 581.7 10891302 581.7
## Vcells 16101010 122.9   27846898 212.5 27846632 212.5

t0 = Sys.time()
obj = df[,c(2,5)]  %>% filter(text != "") %>% coreNLP(host) #丟入本地執行
#丟入coreNLP的物件 必須符合: 是一個data.frame 有一個text欄位

Sys.time() - t0 #執行時間

## Time difference of 4.801094 mins

#Time difference of 4.8 mins

save.image(".RData")

1-3 提取結果 (1). 斷詞、詞彙還原、詞性標註、NER

tokens =  coreNLP_tokens_parser(obj)

head(tokens,20)

##              status_id        word      lemma pos ner
## 1  1381107102932992001       Thank      thank  NN   O
## 2  1381107102932992001         you        you  PN   O
## 3  1378581564951388162         All        all  DT   O
## 4  1378581564951388162          of         of   P   O
## 5  1378581564951388162         our        our  NN   O
## 6  1378581564951388162     prayers    prayers  NN   O
## 7  1378581564951388162         are         be  VC   O
## 8  1378581564951388162        with       with   P   O
## 9  1378581564951388162  youDeepest youdeepest URL   O
## 10 1378581564951388162 condolences condolence  VV   O
## 11 1378581564951388162          to         to   P   O
## 12 1378581564951388162       those      those  JJ   O
## 13 1378581564951388162    families   families  NN   O
## 14 1378581564951388162         who        who  NN   O
## 15 1378581564951388162        lost       lost  NN   O
## 16 1378581564951388162       their      their  NN   O
## 17 1378581564951388162      family     family  JJ   O
## 18 1378581564951388162     members    members  NN   O
## 19 1378581564951388162           .          .  PU   O
## 20 1380557157616214021          We         we  PN   O

coreNLP_tokens_parser欄位: status_id : 對應原本df裡的status_id，為一則tweets的唯一id word: 原始斷詞 lemma : 對斷詞做詞形還原 pos : part-of-speech,詞性 ner: 命名實體

(2). 命名實體標註(NER)

從NER查看特定類型的實體，辨識出哪幾種類型

unique(tokens$ner)

## [1] O            NUMBER       PERSON       DATE         TIME        
## [6] MISC         ORDINAL      ORGANIZATION TITLE       
## Levels: O NUMBER PERSON DATE TIME MISC ORDINAL ORGANIZATION TITLE

#除去entity為Other，有多少種word有被標註entity

length(unique(tokens$word[tokens$ner != "O"]))

## [1] 1805

(3). 轉小寫

因為大小寫也會影響corenlp對NER的判斷，因此我們一開始給的推文內容是沒有處理大小寫的，但在跑完anotator後，為了正確計算詞頻，創建新欄位lower_word與lower_lemma，存放轉換小寫的word與lemma。轉成小寫的目的是要將不同大小寫的同一字詞（如Evergiven與evergiven）都換成小寫，再來計算詞頻

tokens$lower_word = tolower(tokens$word)
tokens$lower_lemma = tolower(tokens$lemma)

1.4 探索分析 - NER 從日期了解被討論的熱度(COUNTRY)

我們可以透過coreNLP中的NER解析出在Twitter上面談論台灣的台積電，所涉及到的日期(DATA)，以初步了解哪些日期是關鍵。

tokens %>%
  filter(ner == "DATE") %>%  #篩選NER為COUNTRY
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 7, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is COUNTRY)") +
  theme(text=element_text(size=14))+
  coord_flip()

1987年台積電創立 2021年因技術領先全球，被稱為護國神山 2023年開始為英特爾代工中央處理器（CPU）

涉及到的組織(ORGANIZATION) 我們可以透過coreNLP中的NER解析出在Twitter上面談論台灣的台積電，所涉及到的組織(ORGANIZATION)，以初步了解這個議題的主要公司/單位。

tokens %>%
  filter(ner == "ORGANIZATION") %>%  #篩選NER為ORGANIZATION
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is ORGANIZATION)") +
  theme(text=element_text(size=14))+
  coord_flip()

Intel.ARM.Samsung.AMD等相關組織都跟台積電有關係 Technology一詞的組織跟台積電脫不了關係

涉及到的人物(PERSON) 我們可以透過coreNLP中的NER解析出在Twitter上面談論台灣的台積電，所涉及到的人物(PERSON)，以初步了解這個議題的主要人物。

tokens %>%
  filter(ner == "PERSON") %>%  #篩選NER為PERSON
  group_by(lower_word) %>% #根據word分組
  summarize(count = n()) %>% #計算每組
  top_n(n = 10, count) %>%
  ungroup() %>% 
  mutate(word = reorder(lower_word, count)) %>%
  ggplot(aes(word, count)) + 
  geom_col()+
  ggtitle("Word Frequency (NER is PERSON)") +
  theme(text=element_text(size=14))+
  coord_flip()

john推測為美國半導體產業協會執行長John Neuffer

1.5 探索分析 - Dependency 語句依存關係結果

dependencies = coreNLP_dependency_parser(obj)

head(dependencies,20)

##              status_id         dep governor governorGloss dependent
## 1  1381107102932992001        ROOT        0          ROOT         2
## 2  1381107102932992001 compound:nn        2           you         1
## 3  1378581564951388162        ROOT        0          ROOT         8
## 4  1378581564951388162       nsubj        8   condolences         1
## 5  1378581564951388162        case        4       prayers         2
## 6  1378581564951388162 compound:nn        4       prayers         3
## 7  1378581564951388162   nmod:prep        8   condolences         4
## 8  1378581564951388162         cop        8   condolences         5
## 9  1378581564951388162        case        7    youDeepest         6
## 10 1378581564951388162   nmod:prep        8   condolences         7
## 11 1378581564951388162        case       16       members         9
## 12 1378581564951388162        amod       12           who        10
## 13 1378581564951388162 compound:nn       12           who        11
## 14 1378581564951388162 compound:nn       16       members        12
## 15 1378581564951388162 compound:nn       14         their        13
## 16 1378581564951388162 compound:nn       16       members        14
## 17 1378581564951388162        amod       16       members        15
## 18 1378581564951388162   nmod:prep        8   condolences        16
## 19 1378581564951388162       punct        8   condolences        17
## 20 1380557157616214021        ROOT        0          ROOT         8
##    dependentGloss
## 1             you
## 2           Thank
## 3     condolences
## 4             All
## 5              of
## 6             our
## 7         prayers
## 8             are
## 9            with
## 10     youDeepest
## 11             to
## 12          those
## 13       families
## 14            who
## 15           lost
## 16          their
## 17         family
## 18        members
## 19              .
## 20   condolences.

視覺化 Dependency tree

parse_tree <- obj[[10]]$doc[[1]][[1]]$parse
tree <- parse2tree(parse_tree)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:httr':
## 
##     content

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

SetNodeStyle(tree, style = "filled,rounded", shape = "box")
plot(tree)

1.6 探索分析 - Sentiment 語句情緒值

情緒分數從最低分0~最高分4 + 0,1 : very negative,negative + 2 : neutral + 3,4 : very positive,postive

sentiment = coreNLP_sentiment_parser(obj)

head(sentiment,20)

##              status_id
## 1  1381107102932992001
## 2  1378581564951388162
## 3  1380557157616214021
## 4  1380140111623565317
## 5  1381106072539164672
## 6  1381106011503689729
## 7  1379374237429850113
## 8  1381105387277934593
## 9  1379668020063182850
## 10 1379588774976311301
## 11 1379583078377480192
## 12 1379583620021579776
## 13 1379304651980337152
## 14 1380406663283281920
## 15 1378844014137417729
## 16 1378474621482135552
## 17 1380338712190410753
## 18 1378617738277527554
## 19 1380080370473820163
## 20 1379714870073913344
##                                                                                                                                                                                               text
## 1                                                                                                                                                                                        Thank you
## 2                                                                                             All of our prayers are with youDeepest condolences to those families who lost their family members .
## 3                                                                                                                                            We people of myanmar are with youDeepest condolences.
## 4                                                                                                                                                                                    Thank you too
## 5                                                                                                                                         Thank youfor supporting us. We are all in this together.
## 6                                                                                                                                                                    Thank Youfor Supporting us !!
## 7                                                                                                     Taiwan's Big Issue. Can't read most of it but still support the local seller when I see her.
## 8                                                                                                                                                              Leveraging Somalilands Blue Economy
## 9                                                                                                                                       Migrant Workers In Taiwan Shine Light On Racism In Society
## 10                                                                                                                                                   Somaliland Progressing On The Diplomacy Front
## 11                                                                                                                    NT10 Billion Investment In Taiwanese Super Battery Factory check out the pic
## 12                                                                                                                                                MediaTek  Samsung Introduce Worlds First WiFi TV
## 13                                                                                                                                Regent Taipei Launches Magic Mirror Fitness Journey Room Package
## 14                                                                                            National Taipei University Of Business Intake Tops Taiwan Rankings Of Public Vocational Universities
## 15                                                                                                                              TTT InterviewLobsang SangayPresident Of Tibets Government In Exile
## 16                                                                                                                                           Uyghurs In Exile To Honor Victims ofHistoric Massacre
## 17                                                                                                                              No Nukes Protest Bracing Up For August Referendum On Nuclear Power
## 18                                                                                                                                              China Boycotts Foreign Brands Over Xinjiang Cotton
## 19                                                                                                                                   Three Red Dot Awards for Outstanding Product Design For DLink
## 20 Before  after  a ceremony ahead of the launch of this year's Taiwan Premier League. Good to see the effort the CTFA have been putting in at all levels to promoting football in Taiwan of late.
##    sentiment sentimentValue
## 1    Neutral              2
## 2    Neutral              2
## 3   Negative              1
## 4    Neutral              2
## 5   Negative              1
## 6    Neutral              2
## 7   Negative              1
## 8    Neutral              2
## 9   Negative              1
## 10   Neutral              2
## 11  Negative              1
## 12   Neutral              2
## 13  Negative              1
## 14  Negative              1
## 15  Negative              1
## 16  Negative              1
## 17  Negative              1
## 18  Negative              1
## 19  Negative              1
## 20  Negative              1

資料集中的情緒種類

unique(sentiment$sentiment)

## [1] Neutral      Negative     Positive     Verynegative Verypositive
## Levels: Neutral Negative Positive Verynegative Verypositive

sentiment$sentimentValue = sentiment$sentimentValue %>% as.numeric

#了解情緒文章的分佈

sentiment$sentiment %>% table()

## .
##      Neutral     Negative     Positive Verynegative Verypositive 
##         1436         3227          280           17            4

平均情緒分數時間趨勢

df$date = as.Date(df$created_at)

sentiment %>% 
  merge(df[,c("status_id","source","date")]) %>%
  group_by(date) %>% 
  summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>% 
  ggplot(aes(x=date,y=avg_sentiment)) + 
  geom_line()

不同用戶端情緒時間趨勢

sentiment %>% 
  merge(df[,c("status_id","source","date")]) %>%
  filter(source %in% c("Twitter Web Client","Twitter for iPhone","Twitter for Android")) %>% 
  group_by(date,source) %>% 
  summarise(avg_sentiment = mean(sentimentValue,na.rm=T)) %>% 
  ggplot(aes(x=date,y=avg_sentiment,color=source)) + 
  geom_line()

了解情緒分佈，以及在正面情緒及負面情緒下，所使用的文章詞彙為何?

#了解正面文章的詞彙使用
sentiment %>% 
  merge(tokens) %>% 
  anti_join(stop_words) %>% 
  filter(!lower_word %in% c('i','the')) %>% 
  filter(sentiment == "Verypositive" | sentiment =='Positive') %>%
  group_by(lower_lemma) %>% #根據lemma分組
  summarize(count = n()) %>% 
  filter(count >5 & count<400)%>%
  wordcloud2()

## Joining, by = "word"

## Warning: Column `word` joining factor and character vector, coercing into
## character vector

#了解負面文章的詞彙使用
sentiment %>% 
  merge(tokens) %>% 
  anti_join(stop_words) %>% 
  filter(!lower_word %in% c('i','the')) %>% 
  filter(sentiment == "Verynegative" | sentiment =='Negative') %>%
  group_by(lower_lemma) %>% 
  summarize(count = n()) %>% 
  filter(count >10 &count<400)%>%
  wordcloud2()

## Joining, by = "word"

## Warning: Column `word` joining factor and character vector, coercing into
## character vector

Sentimentr 英文情緒分析 2.1 簡介sentimentr

library(sentimentr)

mytext <- c(
    'do you like it?  But I hate really bad dogs',
    'I am the best friend.',
    'Do you really like it?  I\'m not a fan'
)

mytext <- get_sentences(mytext) #物件，將character向量轉成list,list裡放著character向量(已斷句)

每個文本的情緒分數

情緒分數為-1~1之間，<0屬於負面，>0屬於正面，0屬於中性

sentiment_by(mytext) #document level

##    element_id word_count       sd ave_sentiment
## 1:          1         10 1.497465    -0.8088680
## 2:          2          5       NA     0.5813777
## 3:          3          9 0.284605     0.2196345

每個句子的情緒分數

sentiment(mytext) #sentence level

##    element_id sentence_id word_count  sentiment
## 1:          1           1          4  0.2500000
## 2:          1           2          6 -1.8677359
## 3:          2           1          5  0.5813777
## 4:          3           1          5  0.4024922
## 5:          3           2          4  0.0000000

回傳4個欄位的dataframe: element_id – 第幾個文本 sentence_id – 該文本中的第幾個句子 word_count – 句子字數 sentiment – 句子的情緒分數

2.2 使用twitter資料實踐在sentimentr 計算tweet中屬於正面的字

set.seed(10)
mytext <- get_sentences(tweets$text) #將text轉成list of characters型態
x <- sample(tweets$text, 1000, replace = FALSE) #隨機取1000筆，取後不放回
sentiment_words <- extract_sentiment_terms(x) #抓取其中帶有情緒的字
sentiment_counts <- attributes(sentiment_words)$counts #計算出現次數
sentiment_counts[polarity > 0,]   #正面的字

##         words polarity  n
##   1:   please      1.0 12
##   2:      top      1.0 11
##   3:  quickly      1.0  5
##   4:   wonder      1.0  4
##   5:  quality      1.0  3
##  ---                     
## 529:    prays      0.1  1
## 530:   prefer      0.1  1
## 531:   aiding      0.1  1
## 532:   pastry      0.1  1
## 533: thriller      0.1  1

計算tweet中屬於負面的字

sentiment_counts[polarity < 0,] %>% arrange(desc(n)) %>% top_n(10) #出現次數最多的負面字

## Selecting by n

##       words polarity  n
## 1  shortage    -0.75 38
## 2       war    -0.50 25
## 3   drought    -0.50 23
## 4  accident    -0.50 23
## 5     crash    -0.75 23
## 6  problems    -0.50 22
## 7    demand    -0.50 21
## 8     fight    -0.50 21
## 9      hell    -0.80 21
## 10 abducted    -1.00 21

highlight每個句子，判斷屬於正/負面

set.seed(12)
df%>%
    filter(status_id %in% sample(unique(status_id), 30)) %>% #隨機30筆貼文
    mutate(review = get_sentences(text)) %$% 
    sentiment_by(review, status_id) %>%
    highlight()

## Saved in /tmp/RtmpmrbFTn/polarity.html

## Opening /tmp/RtmpmrbFTn/polarity.html ...

2.3 用日期來了解情緒波動 code 參考 https://github.com/trinker/sentimentr

tweets$date = format(tweets$created_at,'%Y%m%d')

(out  = tweets  %>%  with(
    sentiment_by( #document level
        get_sentences(text), 
        list( date)
    )
))

##        date word_count        sd ave_sentiment
## 1: 20210403       6214 0.3178200   0.024006530
## 2: 20210404       8914 0.3120603   0.037377585
## 3: 20210405       9939 0.2880040   0.041593417
## 4: 20210406      10981 0.2837754   0.029552804
## 5: 20210407      16217 0.2565380   0.015454642
## 6: 20210408      19673 0.2768096   0.004637294
## 7: 20210409      15455 0.2486074   0.038094908
## 8: 20210410       9671 0.2697096   0.048570705
## 9: 20210411       1964 0.3085321   0.205563232

plot(out)

2.4 用日期來了解不同用戶端情的緒波動

(out  = tweets %>% filter(source %in% c("Twitter Web Client","Twitter for iPhone","Twitter for Android")) %>%  with(
    sentiment_by(
        get_sentences(text), 
        list(source, date)
    )
))

##                  source     date word_count        sd ave_sentiment
##  1: Twitter for Android 20210403       1791 0.2963030   0.041383656
##  2: Twitter for Android 20210404       1884 0.3577777   0.044269272
##  3: Twitter for Android 20210405       1269 0.3084042   0.068604025
##  4: Twitter for Android 20210406       2074 0.2754159   0.056864174
##  5: Twitter for Android 20210407       2599 0.2675583   0.036156592
##  6: Twitter for Android 20210408       4902 0.2707136  -0.007166411
##  7: Twitter for Android 20210409       3063 0.2486039  -0.009105345
##  8: Twitter for Android 20210410       1726 0.2970477   0.056471697
##  9: Twitter for Android 20210411        508 0.2805190   0.398892321
## 10:  Twitter for iPhone 20210403       1798 0.3073076   0.062881559
## 11:  Twitter for iPhone 20210404       2023 0.2986021   0.106781977
## 12:  Twitter for iPhone 20210405       1801 0.2692281   0.044804114
## 13:  Twitter for iPhone 20210406       1798 0.2784472   0.041002162
## 14:  Twitter for iPhone 20210407       3341 0.2707561   0.003204752
## 15:  Twitter for iPhone 20210408       3680 0.2965689  -0.022607100
## 16:  Twitter for iPhone 20210409       2531 0.2662340   0.041338444
## 17:  Twitter for iPhone 20210410       2114 0.2624354   0.075565175
## 18:  Twitter for iPhone 20210411        618 0.3355439   0.191563866

plot(out)

轉換Emoji代碼為語意文字

replace_emoji("\U0001f4aa")

## [1] " flexed biceps "

總結

coreNLP 找出議題核心人物，組織，日期用句法學的分析找出句子相依關係分別找出正、負面文章的常用字 sentimentr 找到tweets中正負面的詞，並且計算每個文本中屬於正負面的句子有哪些根據日期知道情緒的波動、不同用戶端的波動透過老師及助教的指導，練習如何透過推特的API申請來讀取資料，並從中學習到程式的轉寫方法及系統環境設定的重要性。