Corpus in R

1. Corpus

‘말뭉치’
R에서의 Corpus는 Content와 Meta를 가지는 특정한 형태의 텍스트데이터 뭉치.

코퍼스 살펴보기

data("crude")
summary("crude")

##    Length     Class      Mode 
##         1 character character

inspect(crude[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## $`reut-00001.xml`
## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 527

crude[[1]]$content

## [1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"

메타정보

meta(crude[[1]],tag="author")<-"baek"
crude[[1]]$meta

##   author       : baek
##   datetimestamp: 1987-02-26 17:00:56
##   description  : 
##   heading      : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
##   id           : 127
##   language     : en
##   origin       : Reuters-21578 XML
##   topics       : YES
##   lewissplit   : TRAIN
##   cgisplit     : TRAINING-SET
##   oldid        : 5670
##   places       : usa
##   people       : character(0)
##   orgs         : character(0)
##   exchanges    : character(0)

1.1 tm::

tm 패키지는 텍스트 마이닝에 관련된 패키지
일반 텍스트를 정형 데이터로 처리하기 위해 필요한 매트릭스 작성하는 기능
아직 안정되지 않고, 지속적 업그레이드로 사용법도 달라짐

1.1.1 tm::tm_map 함수 예시

tm_map(corpus.tolower) : 소문자로 만들기
tm_map(corpus,stemDocument) : 어근만 남기기
tm_map(corpus,stripWhitespace) : 공백제거
tm_map(corpus,removePunctuation) : 문장부호 제거
tm_map(corpus,removeNumbers) : 숫자 제거
tm_map(corpus,removeWords,“word”) : 단어 제거
tm_map(corpus,remobeWords,stopwords(“english”)) : 불용어 제거
tm_map(corpus,PlainTextDocument) : TextDocument로 변환
일부 함수를 쓰게 되면 코퍼스 구조가 망가짐,
이때 이 함수를 적용하면 코퍼스 구조(content,meta) 구조를 갖게된다.
단, meta정보는 모두 삭제된다.

1.2 문서 행렬

텍스트, 코퍼스에 대하여 전처리 후 행렬 형태로 변환할 필요가 있다.
특히 분석해야 할 텍스트가 여러 개이거나 하나의 문서를 행단위로 분석할 때
그러면서 빈도조사가 필요한 경우에 행렬 형태를 사용하게 된다.

1.2.1 tm::DocumentTermMatrix()

DTM 생성

dtm <- DocumentTermMatrix(crude)
inspect(dtm) # 요약

## <<DocumentTermMatrix (documents: 20, terms: 1266)>>
## Non-/sparse entries: 2255/23065
## Sparsity           : 91%
## Maximal term length: 17
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  and for its mln oil opec prices said that the
##   144   9   5   6   4  11   10      3    9   10  17
##   236   7   4   8   4   7    6      2    6    4  15
##   237  11   3   3   1   3    1      0    0    1  30
##   242   3   1   0   0   3    2      1    3    0   6
##   246   9   6   3   0   4    1      0    4    2  18
##   248   6   2   2   3   9    6      7    5    2  27
##   273   5   4   0   9   5    5      4    5    0  21
##   489   5   4   2   2   4    0      2    2    1   8
##   502   6   5   2   2   4    0      2    2    1  13
##   704   5   3   1   0   3    0      2    3    3  21

inspect(dtm[1:10,1:5]) # 1-10번 문서의 1-5번 단어 확인

## <<DocumentTermMatrix (documents: 10, terms: 5)>>
## Non-/sparse entries: 4/46
## Sparsity           : 92%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##      Terms
## Docs  "(it) "demand "expansion "for "growth
##   127     0       0          0    0       0
##   144     0       1          0    0       0
##   191     0       0          0    0       0
##   194     0       0          0    0       0
##   211     0       0          0    0       0
##   236     0       0          0    0       0
##   237     1       0          0    1       1
##   242     0       0          0    0       0
##   246     0       0          0    0       0
##   248     0       0          0    0       0

# View(t(as.matrix(dtm))) # 전체 매트릭스 확인

빈도 계산 방법

기본은 weightTf
Tf : 일반적인 단어의 빈도
TfIdf : idf는 전체 문서에서 드물게 나타나는 단어에 높은 값을 준다.
그 문서에서 특징적으로 많이 나타나게 되는 단어들은 높은 값을 가진다.
여기에 일반적인 중요성을 나타내는 Tf를 곱하게 되면 특정 주제에서 많이 나타나게 되는 단어들이 높은 값을 갖게 된다.
일반적인 중요성(TF) * 특정 주제에서의 중요성(IDF) = TFIDF

dtm2 <- DocumentTermMatrix(crude,control=list(weighting=weightTfIdf))
inspect(dtm2[1:10,1:5])

## <<DocumentTermMatrix (documents: 10, terms: 5)>>
## Non-/sparse entries: 4/46
## Sparsity           : 92%
## Maximal term length: 10
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##      Terms
## Docs       "(it)    "demand "expansion       "for    "growth
##   127 0.00000000 0.00000000          0 0.00000000 0.00000000
##   144 0.00000000 0.01180855          0 0.00000000 0.00000000
##   191 0.00000000 0.00000000          0 0.00000000 0.00000000
##   194 0.00000000 0.00000000          0 0.00000000 0.00000000
##   211 0.00000000 0.00000000          0 0.00000000 0.00000000
##   236 0.00000000 0.00000000          0 0.00000000 0.00000000
##   237 0.01214025 0.00000000          0 0.01214025 0.01214025
##   242 0.00000000 0.00000000          0 0.00000000 0.00000000
##   246 0.00000000 0.00000000          0 0.00000000 0.00000000
##   248 0.00000000 0.00000000          0 0.00000000 0.00000000

전체 단어 목록 확인 colnames()

head(colnames(dtm2),20)

##  [1] "\"(it)"      "\"demand"    "\"expansion" "\"for"       "\"growth"   
##  [6] "\"if"        "\"is"        "\"may"       "\"none"      "\"opec"     
## [11] "\"opec's"    "\"our"       "\"the"       "\"there"     "\"they"     
## [16] "\"this"      "\"we"        "\"will"      "(bpd)"       "(bpd)."

1.2.2 findFreqTerms(dtm,lowfreq)

자주 출현하는 단어 찾기

dtm <- DocumentTermMatrix(crude)
findFreqTerms(dtm,lowfreq = 10) # 최소 10번 / highfreq(최대 출현 횟수)도 가능

##  [1] "about"      "and"        "are"        "bpd"        "but"       
##  [6] "crude"      "dlrs"       "for"        "from"       "government"
## [11] "has"        "its"        "kuwait"     "last"       "market"    
## [16] "mln"        "new"        "not"        "official"   "oil"       
## [21] "one"        "opec"       "pct"        "price"      "prices"    
## [26] "reuter"     "said"       "said."      "saudi"      "sheikh"    
## [31] "that"       "the"        "they"       "u.s."       "was"       
## [36] "were"       "will"       "with"       "would"

1.2.3 findAssocs(dtm,“word”,0.5)

지정된 단어와 상관관계를 갖는 단어들을 보여준다.

dtm <- DocumentTermMatrix(crude)
findAssocs(dtm,"oil",0.7)

## $oil
##      15.8      opec   clearly      late    trying       who    winter 
##      0.87      0.87      0.80      0.80      0.80      0.80      0.80 
##  analysts      said   meeting     above emergency    market     fixed 
##      0.79      0.78      0.77      0.76      0.75      0.75      0.73 
##      that    prices agreement    buyers 
##      0.73      0.72      0.71      0.70

1.3 코퍼스 생성

1.3.1 파일로부터 코퍼스 생성

docs <- read.any("data/diarytest.csv",header=T)
docs.df <- data.frame(doc_id=docs$id,text=docs$body) # 데이터 프레임 구조로 변환
docs.ds <- DataframeSource(docs.df) 
docs.cp <- Corpus(docs.ds)
inspect(docs.cp)

1.3.2 DTM으로 변환

docs.dtm <- DocumentTermMatrix(docs.cp)
inspect(docs.dtm)

1.3.3 Label을 붙여 DataFrame으로 변환

docs.df <- cbind(as.data.frame(as.matrix(docs.dtm)),LABEL=rep("diary",length(docs.cp)))
docs.df

1.3.4 maekDTM::

패키지 다운로드

library(devtools)
install_github("SukjaeChoi/makeDTM")
library(makeDTM)

자료 준비

docs <- read.any("data/diarytest.csv",header=T)

코퍼스 생성

mydtm <- makeDTM(docs,TEXT.name = "body")

키워드 생성 기능

keyword <- c("엑셀을","컴퓨터가")
mydtm <- makeDTM(docs,key=keyword,TEXT.name = "body")

라벨 부착

mydtm <- makeDTM(docs,key=keyword,TEXT.name = "body",LABEL = T,LABEL.name = "tag")

가중치 변경

makeDTM(docs,key=keyword,weight="tfidf",LABEL = T,LABEL.name = "tag",TEXT.name = "body")

형태소 분석

makeDTM(docs,key=keyword,LABEL = T,LABEL.name = "tag",TEXT.name = "body",RHINO = T)

Corpus in R

건국대학교 통계학과 백광렬 - 2018 빅데이터 청년인재

2018 8 2 (24일차)

1. Corpus

1.1 tm::

1.1.1 tm::tm_map 함수 예시

1.2 문서 행렬

1.2.1 tm::DocumentTermMatrix()

1.2.2 findFreqTerms(dtm,lowfreq)

1.2.3 findAssocs(dtm,“word”,0.5)

1.3 코퍼스 생성

1.3.1 파일로부터 코퍼스 생성

1.3.2 DTM으로 변환

1.3.3 Label을 붙여 DataFrame으로 변환

1.3.4 maekDTM::