1. Corpus

‘말뭉치’
R에서의 Corpus는 Content와 Meta를 가지는 특정한 형태의 텍스트데이터 뭉치

코퍼스 살펴보기

library(tm)

## Warning: package 'tm' was built under R version 3.5.1

## Loading required package: NLP

data("crude")
summary("crude")

##    Length     Class      Mode 
##         1 character character

inspect(crude[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## $`reut-00001.xml`
## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 527

crude[[1]]$content

## [1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"

메타정보

meta(crude[[1]],tag="author")<-"baek"
crude[[1]]$meta

##   author       : baek
##   datetimestamp: 1987-02-26 17:00:56
##   description  : 
##   heading      : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
##   id           : 127
##   language     : en
##   origin       : Reuters-21578 XML
##   topics       : YES
##   lewissplit   : TRAIN
##   cgisplit     : TRAINING-SET
##   oldid        : 5670
##   places       : usa
##   people       : character(0)
##   orgs         : character(0)
##   exchanges    : character(0)

1.1 tm::

tm 패키지는 텍스트 마이닝에 관련된 패키지
일반 텍스트를 정형 데이터로 처리하기 위해 필요한 매트릭스 작성하는 기능
아직 안정되지 않고, 지속적 업그레이드로 사용법도 달라짐

1.1.1 tm::tm_map 함수 예시

tm_map(corpus.tolower) : 소문자로 만들기
tm_map(corpus,stemDocument) : 어근만 남기기
tm_map(corpus,stripWhitespace) : 공백제거
tm_map(corpus,removePunctuation) : 문장부호 제거
tm_map(corpus,removeNumbers) : 숫자 제거
tm_map(corpus,removeWords,“word”) : 단어 제거
tm_map(corpus,remobeWords,stopwords(“english”)) : 불용어 제거
tm_map(corpus,PlainTextDocument) : TextDocument로 변환 일부 함수를 쓰게 되면 코퍼스 구조가 망가짐, 이때 이 함수를 적용하면 코퍼스 구조(content,meta) 구조를 갖게된다. 단, meta정보는 모두 삭제된다.

1.2 문서 행렬

텍스트, 코퍼스에 대하여 전처리 후 행렬 형태로 변환할 필요가 있다. 특히 분석해야 할 텍스트가 여러 개이거나 하나의 문서를 행단위로 분석할 때 그러면서 빈도조사가 필요한 경우에 행렬 형태를 사용하게 된다.

1.2.1 tm::DocumentTermMatrix()

DTM 생성

dtm <- DocumentTermMatrix(crude)
inspect(dtm) # 요약

## <<DocumentTermMatrix (documents: 20, terms: 1266)>>
## Non-/sparse entries: 2255/23065
## Sparsity           : 91%
## Maximal term length: 17
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  and for its mln oil opec prices said that the
##   144   9   5   6   4  11   10      3    9   10  17
##   236   7   4   8   4   7    6      2    6    4  15
##   237  11   3   3   1   3    1      0    0    1  30
##   242   3   1   0   0   3    2      1    3    0   6
##   246   9   6   3   0   4    1      0    4    2  18
##   248   6   2   2   3   9    6      7    5    2  27
##   273   5   4   0   9   5    5      4    5    0  21
##   489   5   4   2   2   4    0      2    2    1   8
##   502   6   5   2   2   4    0      2    2    1  13
##   704   5   3   1   0   3    0      2    3    3  21

inspect(dtm[1:10,1:5]) # 1-10번 문서의 1-5번 단어 확인

## <<DocumentTermMatrix (documents: 10, terms: 5)>>
## Non-/sparse entries: 4/46
## Sparsity           : 92%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  "(it) "demand "expansion "for "growth
##   127     0       0          0    0       0
##   144     0       1          0    0       0
##   191     0       0          0    0       0
##   194     0       0          0    0       0
##   211     0       0          0    0       0
##   236     0       0          0    0       0
##   237     1       0          0    1       1
##   242     0       0          0    0       0
##   246     0       0          0    0       0
##   248     0       0          0    0       0

# View(t(as.matrix(dtm))) # 전체 매트릭스 확인

빈도 계산 방법

기본은 weightTf
Tf : 일반적인 단어의 빈도
TfIdf : idf는 전체 문서에서 드물게 나타나는 단어에 높은 값을 준다. 그 문서에서 특징적으로 많이 나타나게 되는 단어들은 높은 값을 가진다. 여기에 일반적인 중요성을 나타내는 Tf를 곱하게 되면 특정 주제에서 많이 나타나게 되는 단어들이 높은 값을 갖게 된다. 일반적인 중요성(TF) * 특정 주제에서의 중요성(IDF) = TFIDF

dtm2 <- DocumentTermMatrix(crude,control=list(weighting=weightTfIdf))
inspect(dtm2[1:10,1:5])

## <<DocumentTermMatrix (documents: 10, terms: 5)>>
## Non-/sparse entries: 4/46
## Sparsity           : 92%
## Maximal term length: 10
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##      Terms
## Docs       "(it)    "demand "expansion       "for    "growth
##   127 0.00000000 0.00000000          0 0.00000000 0.00000000
##   144 0.00000000 0.01180855          0 0.00000000 0.00000000
##   191 0.00000000 0.00000000          0 0.00000000 0.00000000
##   194 0.00000000 0.00000000          0 0.00000000 0.00000000
##   211 0.00000000 0.00000000          0 0.00000000 0.00000000
##   236 0.00000000 0.00000000          0 0.00000000 0.00000000
##   237 0.01214025 0.00000000          0 0.01214025 0.01214025
##   242 0.00000000 0.00000000          0 0.00000000 0.00000000
##   246 0.00000000 0.00000000          0 0.00000000 0.00000000
##   248 0.00000000 0.00000000          0 0.00000000 0.00000000

전체 단어 목록 확인 colnames()

head(colnames(dtm2),20)

##  [1] "\"(it)"      "\"demand"    "\"expansion" "\"for"       "\"growth"   
##  [6] "\"if"        "\"is"        "\"may"       "\"none"      "\"opec"     
## [11] "\"opec's"    "\"our"       "\"the"       "\"there"     "\"they"     
## [16] "\"this"      "\"we"        "\"will"      "(bpd)"       "(bpd)."

1.2.2 findFreqTerms(dtm,lowfreq)

자주 출현하는 단어 찾기

dtm <- DocumentTermMatrix(crude)
findFreqTerms(dtm,lowfreq = 10) # 최소 10번 / highfreq(최대 출현 횟수)도 가능

##  [1] "about"      "and"        "are"        "bpd"        "but"       
##  [6] "crude"      "dlrs"       "for"        "from"       "government"
## [11] "has"        "its"        "kuwait"     "last"       "market"    
## [16] "mln"        "new"        "not"        "official"   "oil"       
## [21] "one"        "opec"       "pct"        "price"      "prices"    
## [26] "reuter"     "said"       "said."      "saudi"      "sheikh"    
## [31] "that"       "the"        "they"       "u.s."       "was"       
## [36] "were"       "will"       "with"       "would"

1.2.3 findAssocs(dtm,“word”,0.5)

지정된 단어와 상관관계를 갖는 단어들을 보여준다.

dtm <- DocumentTermMatrix(crude)
findAssocs(dtm,"oil",0.7)

## $oil
##      15.8      opec   clearly      late    trying       who    winter 
##      0.87      0.87      0.80      0.80      0.80      0.80      0.80 
##  analysts      said   meeting     above emergency    market     fixed 
##      0.79      0.78      0.77      0.76      0.75      0.75      0.73 
##      that    prices agreement    buyers 
##      0.73      0.72      0.71      0.70

1.3 코퍼스 생성

1.3.1 파일로부터 코퍼스 생성

docs <- read.csv("diarytest.csv",header=T)
docs.df <- data.frame(doc_id=docs$id,text=docs$body) # 데이터 프레임 구조로 변환
docs.ds <- DataframeSource(docs.df) 
docs.cp <- Corpus(docs.ds)
inspect(docs.cp)

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 4
## 
##                                                              1 
##              오늘은 엑셀을 사용한 첫 날이다. 엑셀을 참 편하다. 
##                                                              2 
## 컴퓨터가 고장났다!!! 컴퓨터가 고장나면 나는 엑셀을 할 수 없다. 
##                                                              3 
##                           1년만에 일기를 다시 쓴다. 방가방가~~ 
##                                                              4 
##                 엑셀로 쓰는 일기는 우리의 생활을 풍요롭게 한다

1.3.2 DTM으로 변환

docs.dtm <- DocumentTermMatrix(docs.cp)
inspect(docs.dtm)

## <<DocumentTermMatrix (documents: 4, terms: 22)>>
## Non-/sparse entries: 23/65
## Sparsity           : 74%
## Maximal term length: 4
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs 고장나면 고장났다 나는 날이다 사용한 없다 엑셀을 오늘은 컴퓨터가
##    1        0        0    0      1      1    0      2      1        0
##    2        1        1    1      0      0    1      1      0        2
##    3        0        0    0      0      0    0      0      0        0
##    4        0        0    0      0      0    0      0      0        0
##     Terms
## Docs 편하다
##    1      1
##    2      0
##    3      0
##    4      0

1.3.3 Label을 붙여 DataFrame으로 변환

docs.df <- cbind(as.data.frame(as.matrix(docs.dtm)),LABEL=rep("diary",length(docs.cp)))
docs.df

##   날이다 사용한 엑셀을 오늘은 편하다 고장나면 고장났다 나는 없다 컴퓨터가
## 1      1      1      2      1      1        0        0    0    0        0
## 2      0      0      1      0      0        1        1    1    1        2
## 3      0      0      0      0      0        0        0    0    0        0
## 4      0      0      0      0      0        0        0    0    0        0
##   1년만에 다시 방가방가 쓴다 일기를 생활을 쓰는 엑셀로 우리의 일기는
## 1       0    0        0    0      0      0    0      0      0      0
## 2       0    0        0    0      0      0    0      0      0      0
## 3       1    1        1    1      1      0    0      0      0      0
## 4       0    0        0    0      0      1    1      1      1      1
##   풍요롭게 한다 LABEL
## 1        0    0 diary
## 2        0    0 diary
## 3        0    0 diary
## 4        1    1 diary

1.3.4 maekDTM::

패키지 다운로드

library(devtools)
install_github("SukjaeChoi/makeDTM")

## Skipping install of 'makeDTM' from a github remote, the SHA1 (a545b511) has not changed since last install.
##   Use `force = TRUE` to force installation

library(makeDTM)

## 
## Attaching package: 'makeDTM'

## The following objects are masked from 'package:tm':
## 
##     findAssocs, findFreqTerms

자료 준비

docs <- read.csv("diarytest.csv",header=T)

코퍼스 생성

mydtm <- makeDTM(docs,TEXT.name = "body")

키워드 생성 기능

keyword <- c("엑셀을","컴퓨터가")
mydtm <- makeDTM(docs,key=keyword,TEXT.name = "body")

라벨 부착

mydtm <- makeDTM(docs,key=keyword,TEXT.name = "body",LABEL = T,LABEL.name = "tag")

가중치 변경

makeDTM(docs,key=keyword,weight="tfidf",LABEL = T,LABEL.name = "tag",TEXT.name = "body")

##      엑셀을 컴퓨터가  LABEL
## 1 1.3862944 0.000000 diary1
## 2 0.6931472 2.772589 diary1
## 3 0.0000000 0.000000 diary2
## 4 0.0000000 0.000000 diary2

형태소 분석

library(RHINO)
makeDTM(docs,key=keyword,LABEL = T,LABEL.name = "tag",TEXT.name = "body",RHINO = T)

Corpus in R

장성환

2018 8 2

1. Corpus

1.1 tm::

1.1.1 tm::tm_map 함수 예시

1.2 문서 행렬

1.2.1 tm::DocumentTermMatrix()

1.2.2 findFreqTerms(dtm,lowfreq)

1.2.3 findAssocs(dtm,“word”,0.5)

1.3 코퍼스 생성

1.3.1 파일로부터 코퍼스 생성

1.3.2 DTM으로 변환

1.3.3 Label을 붙여 DataFrame으로 변환

1.3.4 maekDTM::