Introduction

EICUT-AI

Extracting Informative Cases Based on TextRank - AI

Souce Code

Library

라이브러리가 설치되어 있지 않다면 다음 코드를
실행해서 설치한다.

install.packages("dplyr")
install.packages("tidytext")
install.packages("textrank")

필요한 라이브러리를 다음과 같이 불러온다.

library(dplyr)
library(tidytext)
library(textrank)

함수

textrank로 informative case의 값 계산하기

obj 입력데이터의 형식은 data.frame 혹은 tibble 객체다.
컬럼 정보는 다음과 같다.
* 첫번째 컬럼의 이름 = ID : ID * 두번째 컬럼의 이름 = text : sentence 혹은 title과 같이 문장으로 구분되는 데이터

즉, 테이블의 첫번째 컬럼은 ID, 두번째 컬럼은 text로 이름을 맞춰서 입력한다.
만약 이름이 다르다면 names()함수를 이용해 이름을 수정한다.

dictionary는 계산에 사용할 사전이다. 만약 사전을 입력하지 않으면 주어진
데이터에서 token을 추출하여 사용한다.
관심을 두고 있는 키워드나 단어를 활용하여 textrank 결과를 질적으로
컨트롤 할 수 있다.
사전은 tibble이나 data.frame 형식으로,
첫번째 컬럼은 ID, 두번째 컬럼은 word라는 이름을 가지고 있어야 한다.
즉, 각각의 text에서 어떤 word를 계산할 지를 mapping하는 데이터다.

eicut <- function(obj,dictionary=NULL) {
  if(is.null(dictionary)) {
    dicionary <- obj %>% unnest_tokens(word,text) %>% select(word)
  }
  textrank_sentences(
    data = obj,
    terminology=dictionary
  )
}

informative items 골라오기

eicut()의 결과를 obj로 받고, n= 옵션을 이용해 가장 정보량이 많은 아이템을 추출

find_top_doc <- function(obj,n=10) {
  a=obj[['sentences']] %>% 
    arrange(desc(textrank)) %>% 
    slice(1:n)
  a
}

전체 소스코드는 다음과 같다.

library(dplyr)
library(tidytext)
library(textrank)
eicut <- function(obj,dictionary=NULL) {
  names(obj) = c("ID","text")
  if(is.null(dictionary)) {
    dicionary <- obj %>% unnest_tokens(word,text)
  }
  textrank_sentences(
    data = obj,
    terminology=dictionary
  )
}
find_top_doc <- function(obj,n=10) {
  a=obj[['sentences']] %>% 
    arrange(desc(textrank)) %>% 
    slice(1:n)
  a
}

예제

어린이용 Fitbit tracker의 내용을 분석하는
예제를 작성한다.

article_sentences를 다음과 같이 획득한다.

library(rvest)
url <- "http://time.com/5196761/fitbit-ace-kids-fitness-tracker/"
article <- read_html(url) %>% 
  html_nodes('div[class="padded"]') %>% 
  html_text()
article_sentences <- tibble(
  text=article
) %>% 
  unnest_tokens(sentence,text,token='sentences') %>% 
  mutate(sentence_id=row_number()) %>% 
  select(sentence_id,sentence)

데이터의 이름을 ID 와 text 로 수정한다.
어떤 모습인지 확인하기 위해 처음 데이터 몇 개를 추려보자.

names(article_sentences) <- c("ID","text")
head(article_sentences)

사전을 제작한다.
영어 stop_words를 anti_join()으로 제거한다.
만약 포함시킬 단어만을 추리려면 semi_join()을 활용하여
포함시킬 단어가 있는 data.frame을 지정해야 한다.
이 data.frame 객체는 word라는 컬럼만으로 구성되며
중복값을 가지고 있지 않아야 한다.

article_words <- article_sentences %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words,by='word')

EICUT을 실행한다.

article_summary <- eicut(article_sentences,article_words)

가장 정보량이 많은 10개의 상위 문서를추출한다.

result <- find_top_doc(article_summary)
result

결과를 보면 다음 컬럼으로 구성되어 있다.

이제 result의 결과에서 키워드를 tidytext로 다시 끄집어내보자.
4회 이상 등장한 키워드는 주요 무엇일까?

result %>% 
  select(sentence) %>% 
  unnest_tokens(word,sentence) %>% 
  anti_join(stop_words) %>% 
  count(word) %>% 
  filter(n>=4) %>% 
  arrange(desc(n)) 
Joining with `by = join_by(word)`

이에 따라 현재 문서의 주요 토픽은 fitbit, children, tracker, ace라는 단어로
기술될 수 있음을 알 수 있다.

LS0tDQp0aXRsZTogIkVJQ1VULUFJIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KIyBJbnRyb2R1Y3Rpb24NCiMjIEVJQ1VULUFJDQoNCkV4dHJhY3RpbmcgSW5mb3JtYXRpdmUgQ2FzZXMgQmFzZWQgb24gVGV4dFJhbmsgLSBBSQ0KDQoqIEJ1c2luZXNzIERhdGEgTGFiDQoqIFRhZWt5dW5nIEtpbSwgUGhELiBQcm9mZXNzb3INCiogdGtfa2ltQGtodS5hYy5rcg0KDQojIFNvdWNlIENvZGUNCg0KIyMgTGlicmFyeQ0KDQrrnbzsnbTruIzrn6zrpqzqsIAg7ISk7LmY65CY7Ja0IOyeiOyngCDslYrri6TrqbQg64uk7J2MIOy9lOuTnOulvFwNCuyLpO2Wie2VtOyEnCDshKTsuZjtlZzri6QuDQoNCmBgYHtyfQ0KaW5zdGFsbC5wYWNrYWdlcygiZHBseXIiKQ0KaW5zdGFsbC5wYWNrYWdlcygidGlkeXRleHQiKQ0KaW5zdGFsbC5wYWNrYWdlcygidGV4dHJhbmsiKQ0KYGBgDQoNCu2VhOyalO2VnCDrnbzsnbTruIzrn6zrpqzrpbwg64uk7J2M6rO8IOqwmeydtCDrtojrn6zsmKjri6QuDQoNCmBgYHtyfQ0KbGlicmFyeShkcGx5cikNCmxpYnJhcnkodGlkeXRleHQpDQpsaWJyYXJ5KHRleHRyYW5rKQ0KYGBgDQoNCiMjIO2VqOyImA0KDQojIyMgdGV4dHJhbmvroZwgaW5mb3JtYXRpdmUgY2FzZeydmCDqsJIg6rOE7IKw7ZWY6riwDQoNCm9iaiDsnoXroKXrjbDsnbTthLDsnZgg7ZiV7Iud7J2AIGBkYXRhLmZyYW1lYCDtmLnsnYAgYHRpYmJsZWAg6rCd7LK064ukLlwNCuy7rOufvCDsoJXrs7TripQg64uk7J2M6rO8IOqwmeuLpC5cDQoqIOyyq+uyiOynuCDsu6zrn7zsnZgg7J2066aEID0gSUQgOiBJRA0KKiDrkZDrsojsp7gg7Lus65+87J2YIOydtOumhCA9IHRleHQgOiBzZW50ZW5jZSDtmLnsnYAgdGl0bGXqs7wg6rCZ7J20IOusuOyepeycvOuhnCDqtazrtoTrkJjripQg642w7J207YSwDQoNCuymiSwg7YWM7J2067iU7J2YIOyyq+uyiOynuCDsu6zrn7zsnYAgSUQsIOuRkOuyiOynuCDsu6zrn7zsnYAgdGV4dOuhnCDsnbTrpoTsnYQg66ee7Law7IScIOyeheugpe2VnOuLpC5cDQrrp4zslb0g7J2066aE7J20IOuLpOultOuLpOuptCBuYW1lcygp7ZWo7IiY66W8IOydtOyaqe2VtCDsnbTrpoTsnYQg7IiY7KCV7ZWc64ukLg0KDQpkaWN0aW9uYXJ564qUIOqzhOyCsOyXkCDsgqzsmqntlaAg7IKs7KCE7J2064ukLiDrp4zslb0g7IKs7KCE7J2EIOyeheugpe2VmOyngCDslYrsnLzrqbQg7KO87Ja07KeEXA0K642w7J207YSw7JeQ7IScIHRva2Vu7J2EIOy2lOy2nO2VmOyXrCDsgqzsmqntlZzri6QuXA0K6rSA7Ius7J2EIOuRkOqzoCDsnojripQg7YKk7JuM65Oc64KYIOuLqOyWtOulvCDtmZzsmqntlZjsl6wgYHRleHRyYW5rYCDqsrDqs7zrpbwg7KeI7KCB7Jy866GcXA0K7Luo7Yq466GkIO2VoCDsiJgg7J6I64ukLlwNCuyCrOyghOydgCB0aWJibGXsnbTrgpggZGF0YS5mcmFtZSDtmJXsi53snLzroZwsXA0K7LKr67KI7Ke4IOy7rOufvOydgCBJRCwg65GQ67KI7Ke4IOy7rOufvOydgCB3b3Jk652864qUIOydtOumhOydhCDqsIDsp4Dqs6Ag7J6I7Ja07JW8IO2VnOuLpC4gIA0K7KaJLCDqsIHqsIHsnZggdGV4dOyXkOyEnCDslrTrlqQgd29yZOulvCDqs4TsgrDtlaAg7KeA66W8IG1hcHBpbmftlZjripQg642w7J207YSw64ukLg0KDQpgYGB7cn0NCmVpY3V0IDwtIGZ1bmN0aW9uKG9iaixkaWN0aW9uYXJ5PU5VTEwpIHsNCiAgbmFtZXMob2JqKSA9IGMoIklEIiwidGV4dCIpDQogIGlmKGlzLm51bGwoZGljdGlvbmFyeSkpIHsNCiAgICBkaWNpb25hcnkgPC0gb2JqICU+JSB1bm5lc3RfdG9rZW5zKHdvcmQsdGV4dCkNCiAgfQ0KICB0ZXh0cmFua19zZW50ZW5jZXMoDQogICAgZGF0YSA9IG9iaiwNCiAgICB0ZXJtaW5vbG9neT1kaWN0aW9uYXJ5DQogICkNCn0NCmBgYA0KDQojIyMgaW5mb3JtYXRpdmUgaXRlbXMg6rOo65287Jik6riwDQoNCmBlaWN1dCgpYOydmCDqsrDqs7zrpbwgb2Jq66GcIOuwm+qzoCwgbj0g7Ji17IWY7J2EIOydtOyaqe2VtCDqsIDsnqUg7KCV67O065+J7J20IOunjuydgCDslYTsnbTthZzsnYQg7LaU7LacICANCg0KYGBge3J9DQpmaW5kX3RvcF9kb2MgPC0gZnVuY3Rpb24ob2JqLG49MTApIHsNCiAgYT1vYmpbWydzZW50ZW5jZXMnXV0gJT4lIA0KICAgIGFycmFuZ2UoZGVzYyh0ZXh0cmFuaykpICU+JSANCiAgICBzbGljZSgxOm4pDQogIGENCn0NCmBgYA0KDQoNCuyghOyytCDshozsiqTsvZTrk5zripQg64uk7J2M6rO8IOqwmeuLpC4NCg0KYGBgDQpsaWJyYXJ5KGRwbHlyKQ0KbGlicmFyeSh0aWR5dGV4dCkNCmxpYnJhcnkodGV4dHJhbmspDQplaWN1dCA8LSBmdW5jdGlvbihvYmosZGljdGlvbmFyeT1OVUxMKSB7DQogIG5hbWVzKG9iaikgPSBjKCJJRCIsInRleHQiKQ0KICBpZihpcy5udWxsKGRpY3Rpb25hcnkpKSB7DQogICAgZGljaW9uYXJ5IDwtIG9iaiAlPiUgdW5uZXN0X3Rva2Vucyh3b3JkLHRleHQpDQogIH0NCiAgdGV4dHJhbmtfc2VudGVuY2VzKA0KICAgIGRhdGEgPSBvYmosDQogICAgdGVybWlub2xvZ3k9ZGljdGlvbmFyeQ0KICApDQp9DQpmaW5kX3RvcF9kb2MgPC0gZnVuY3Rpb24ob2JqLG49MTApIHsNCiAgYT1vYmpbWydzZW50ZW5jZXMnXV0gJT4lIA0KICAgIGFycmFuZ2UoZGVzYyh0ZXh0cmFuaykpICU+JSANCiAgICBzbGljZSgxOm4pDQogIGENCn0NCmBgYA0KDQoNCiMg7JiI7KCcDQoNCuyWtOumsOydtOyaqSBGaXRiaXQgdHJhY2tlcuydmCDrgrTsmqnsnYQg67aE7ISd7ZWY64qUICANCuyYiOygnOulvCDsnpHshLHtlZzri6QuDQoNCmFydGljbGVfc2VudGVuY2Vz66W8IOuLpOydjOqzvCDqsJnsnbQg7ZqN65Od7ZWc64ukLiANCmBgYHtyfQ0KbGlicmFyeShydmVzdCkNCnVybCA8LSAiaHR0cDovL3RpbWUuY29tLzUxOTY3NjEvZml0Yml0LWFjZS1raWRzLWZpdG5lc3MtdHJhY2tlci8iDQphcnRpY2xlIDwtIHJlYWRfaHRtbCh1cmwpICU+JSANCiAgaHRtbF9ub2RlcygnZGl2W2NsYXNzPSJwYWRkZWQiXScpICU+JSANCiAgaHRtbF90ZXh0KCkNCmFydGljbGVfc2VudGVuY2VzIDwtIHRpYmJsZSgNCiAgdGV4dD1hcnRpY2xlDQopICU+JSANCiAgdW5uZXN0X3Rva2VucyhzZW50ZW5jZSx0ZXh0LHRva2VuPSdzZW50ZW5jZXMnKSAlPiUgDQogIG11dGF0ZShzZW50ZW5jZV9pZD1yb3dfbnVtYmVyKCkpICU+JSANCiAgc2VsZWN0KHNlbnRlbmNlX2lkLHNlbnRlbmNlKQ0KYGBgDQoNCuuNsOydtO2EsOydmCDsnbTrpoTsnYQgSUQg7JmAIHRleHQg66GcIOyImOygle2VnOuLpC4gIA0K7Ja065akIOuqqOyKteyduOyngCDtmZXsnbjtlZjquLAg7JyE7ZW0IOyymOydjCDrjbDsnbTthLAg66qHIOqwnOulvCDstpTroKTrs7TsnpAuICANCg0KYGBge3J9DQpuYW1lcyhhcnRpY2xlX3NlbnRlbmNlcykgPC0gYygiSUQiLCJ0ZXh0IikNCmhlYWQoYXJ0aWNsZV9zZW50ZW5jZXMpDQpgYGANCuyCrOyghOydhCDsoJzsnpHtlZzri6QuICANCuyYgeyWtCBzdG9wX3dvcmRz66W8IGFudGlfam9pbigp7Jy866GcIOygnOqxsO2VnOuLpC4gIA0K66eM7JW9IO2PrO2VqOyLnO2CrCDri6jslrTrp4zsnYQg7LaU66as66Ck66m0IHNlbWlfam9pbigp7J2EIO2ZnOyaqe2VmOyXrCAgDQrtj6ztlajsi5ztgqwg64uo7Ja06rCAIOyeiOuKlCBkYXRhLmZyYW1l7J2EIOyngOygle2VtOyVvCDtlZzri6QuICANCuydtCBkYXRhLmZyYW1lIOqwneyytOuKlCB3b3Jk652864qUIOy7rOufvOunjOycvOuhnCDqtazshLHrkJjrqbAgIA0K7KSR67O16rCS7J2EIOqwgOyngOqzoCDsnojsp4Ag7JWK7JWE7JW8IO2VnOuLpC4NCg0KYGBge3J9DQphcnRpY2xlX3dvcmRzIDwtIGFydGljbGVfc2VudGVuY2VzICU+JSANCiAgdW5uZXN0X3Rva2Vucyh3b3JkLHRleHQpICU+JSANCiAgYW50aV9qb2luKHN0b3Bfd29yZHMsYnk9J3dvcmQnKQ0KYGBgDQoNCkVJQ1VU7J2EIOyLpO2Wie2VnOuLpC4NCg0KYGBge3J9DQphcnRpY2xlX3N1bW1hcnkgPC0gZWljdXQoYXJ0aWNsZV9zZW50ZW5jZXMsYXJ0aWNsZV93b3JkcykNCmBgYA0KDQrqsIDsnqUg7KCV67O065+J7J20IOunjuydgCAxMOqwnOydmCDsg4HsnIQg66y47ISc66W87LaU7Lac7ZWc64ukLg0KDQpgYGB7cn0NCnJlc3VsdCA8LSBmaW5kX3RvcF9kb2MoYXJ0aWNsZV9zdW1tYXJ5KQ0KcmVzdWx0DQpgYGANCuqysOqzvOulvCDrs7TrqbQg64uk7J2MIOy7rOufvOycvOuhnCDqtazshLHrkJjslrQg7J6I64ukLiANCg0KKiB0ZXh0cmFua19pZA0KKiBzZW50ZW5jZQ0KKiB0ZXh0cmFuaw0KDQrsnbTsoJwgcmVzdWx07J2YIOqysOqzvOyXkOyEnCDtgqTsm4zrk5zrpbwgdGlkeXRleHTroZwg64uk7IucIOuBhOynkeyWtOuCtOuztOyekC4gIA0KNO2ajCDsnbTsg4Eg65Ox7J6l7ZWcIO2CpOybjOuTnOuKlCDso7zsmpQg66y07JeH7J286rmMPw0KDQpgYGB7cn0NCnJlc3VsdCAlPiUgDQogIHNlbGVjdChzZW50ZW5jZSkgJT4lIA0KICB1bm5lc3RfdG9rZW5zKHdvcmQsc2VudGVuY2UpICU+JSANCiAgYW50aV9qb2luKHN0b3Bfd29yZHMpICU+JSANCiAgY291bnQod29yZCkgJT4lIA0KICBmaWx0ZXIobj49NCkgJT4lIA0KICBhcnJhbmdlKGRlc2MobikpIA0KYGBgDQoNCuydtOyXkCDrlLDrnbwg7ZiE7J6sIOusuOyEnOydmCDso7zsmpQg7Yag7ZS97J2AIGZpdGJpdCwgY2hpbGRyZW4sIHRyYWNrZXIsIGFjZeudvOuKlCDri6jslrTroZwgIA0K6riw7Iig65CgIOyImCDsnojsnYzsnYQg7JWMIOyImCDsnojri6QuDQo=