tidytext使用整洁的数据原则使许多文本挖掘任务更容易、更有效，并且与已经广泛使用的工具包保持一致。在这个包中，提供了函数和支持数据集，以允许文本与整洁格式之间的转换，以及在整洁工具和现有文本挖掘包之间无缝切换。

什么是整洁的数据tidy data！

1.variable is a column 2. Each observation is a row 3. Each type of observational unit is a table

通常来，整洁的数据通通常用数据框来表示。

unnest_tokens

我们首先

text <- c("今天天气不错",
          "呆在家里打游戏吧",
          "打什么用游戏",
          "塞尔达传说")

text

## [1] "今天天气不错"     "呆在家里打游戏吧" "打什么用游戏"     "塞尔达传说"

# 将数据转变成为数据框

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

text_df <- tibble(line = 1:4, text = text)

text_df

## # A tibble: 4 × 2
##    line text            
##   <int> <chr>           
## 1     1 今天天气不错    
## 2     2 呆在家里打游戏吧
## 3     3 打什么用游戏    
## 4     4 塞尔达传说

在这个数据框中，每一行都是由多个组合单词组成。我们需要进一步处理，这个处理过程叫做：tokenization ，标记化。

token是一个有意义的文本单元，通常是一个词，我们有兴趣将其用于进一步分析，

library(tidytext)

text_df_tokens<- text_df %>%
  unnest_tokens(word, text)

text_df_tokens

## # A tibble: 16 × 2
##     line word 
##    <int> <chr>
##  1     1 今天 
##  2     1 天气 
##  3     1 不错 
##  4     2 呆在 
##  5     2 家里 
##  6     2 打   
##  7     2 游戏 
##  8     2 吧   
##  9     3 打   
## 10     3 什么 
## 11     3 用   
## 12     3 游戏 
## 13     4 塞   
## 14     4 尔   
## 15     4 达   
## 16     4 传说

如果想要去除停顿词，参考如下代码：

word_df = anti_join(stop_words)

计算词频率

text_df_tokens %>% count(word, sort = TRUE)

## # A tibble: 14 × 2
##    word      n
##    <chr> <int>
##  1 打        2
##  2 游戏      2
##  3 不错      1
##  4 什么      1
##  5 今天      1
##  6 传说      1
##  7 吧        1
##  8 呆在      1
##  9 塞        1
## 10 天气      1
## 11 家里      1
## 12 尔        1
## 13 用        1
## 14 达        1

情感分析

在多种用于评估文本中的观点或情感的方法和词典。tidytext 包提供了对几个情感词典的访问。三个通用词典是：

AFINN来自芬兰奥鲁普尼尔森，
bing来自Bing Liu 和合作者，以及
nrc来自赛义夫穆罕默德和彼得特尼。

所有这三个词典都基于一元组，即单个单词。这些词典包含许多英语单词，并且这些单词被分配了正面/负面情绪的分数，也可能是喜悦、愤怒、悲伤等情绪。该nrc词典以二元方式（“是”/“否”）将单词分类为积极、消极、愤怒、预期、厌恶、恐惧、快乐、悲伤、惊讶和信任的类别。bing以二进制方式将单词分为正面和负面类别。词典为单词分配一个介于 -5 和 5 之间的AFINN分数，负分表示负面情绪，正分表示正面情绪。

library(tidytext)
library(textdata)
get_sentiments("afinn")

## # A tibble: 2,477 × 2
##    word       value
##    <chr>      <dbl>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,467 more rows

相当于提供了文本的情感标签。get_sentiments()使我们能够获得特定的情感词典

tfidf

library(dplyr)
library(janeaustenr)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE)

book_words

## # A tibble: 40,379 × 3
##    book              word      n
##    <fct>             <chr> <int>
##  1 Mansfield Park    the    6206
##  2 Mansfield Park    to     5475
##  3 Mansfield Park    and    5438
##  4 Emma              to     5239
##  5 Emma              the    5201
##  6 Emma              and    4896
##  7 Mansfield Park    of     4778
##  8 Pride & Prejudice the    4331
##  9 Emma              of     4291
## 10 Pride & Prejudice to     4162
## # … with 40,369 more rows

# find the words most distinctive to each document
book_words %>%
  bind_tf_idf(word, book, n) %>%
  arrange(desc(tf_idf))

## # A tibble: 40,379 × 6
##    book                word          n      tf   idf  tf_idf
##    <fct>               <chr>     <int>   <dbl> <dbl>   <dbl>
##  1 Sense & Sensibility elinor      623 0.00519  1.79 0.00931
##  2 Sense & Sensibility marianne    492 0.00410  1.79 0.00735
##  3 Mansfield Park      crawford    493 0.00307  1.79 0.00551
##  4 Pride & Prejudice   darcy       373 0.00305  1.79 0.00547
##  5 Persuasion          elliot      254 0.00304  1.79 0.00544
##  6 Emma                emma        786 0.00488  1.10 0.00536
##  7 Northanger Abbey    tilney      196 0.00252  1.79 0.00452
##  8 Emma                weston      389 0.00242  1.79 0.00433
##  9 Pride & Prejudice   bennet      294 0.00241  1.79 0.00431
## 10 Persuasion          wentworth   191 0.00228  1.79 0.00409
## # … with 40,369 more rows

n-grams

看单词 X 后面跟着单词Y的频率，我们可以构建它们之间关系的模型。

library(dplyr)
library(tidytext)
library(janeaustenr)

austen_bigrams <- austen_books() %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)

austen_bigrams

## # A tibble: 675,025 × 2
##    book                bigram         
##    <fct>               <chr>          
##  1 Sense & Sensibility sense and      
##  2 Sense & Sensibility and sensibility
##  3 Sense & Sensibility <NA>           
##  4 Sense & Sensibility by jane        
##  5 Sense & Sensibility jane austen    
##  6 Sense & Sensibility <NA>           
##  7 Sense & Sensibility <NA>           
##  8 Sense & Sensibility <NA>           
##  9 Sense & Sensibility <NA>           
## 10 Sense & Sensibility <NA>           
## # … with 675,015 more rows

tidytext教程

MiLin

2022-11-03

unnest_tokens

情感分析

tfidf

n-grams