Introduction
EICUT-AI
Extracting Informative Cases Based on TextRank - AI
Souce Code
Library
라이브러리가 설치되어 있지 않다면 다음 코드를
실행해서 설치한다.
install.packages("dplyr")
install.packages("tidytext")
install.packages("textrank")
필요한 라이브러리를 다음과 같이 불러온다.
library(dplyr)
library(tidytext)
library(textrank)
예제
어린이용 Fitbit tracker의 내용을 분석하는
예제를 작성한다.
article_sentences를 다음과 같이 획득한다.
library(rvest)
url <- "http://time.com/5196761/fitbit-ace-kids-fitness-tracker/"
article <- read_html(url) %>%
html_nodes('div[class="padded"]') %>%
html_text()
article_sentences <- tibble(
text=article
) %>%
unnest_tokens(sentence,text,token='sentences') %>%
mutate(sentence_id=row_number()) %>%
select(sentence_id,sentence)
데이터의 이름을 ID 와 text 로 수정한다.
어떤 모습인지 확인하기 위해 처음 데이터 몇 개를 추려보자.
names(article_sentences) <- c("ID","text")
head(article_sentences)
사전을 제작한다.
영어 stop_words를 anti_join()으로 제거한다.
만약 포함시킬 단어만을 추리려면 semi_join()을 활용하여
포함시킬 단어가 있는 data.frame을 지정해야 한다.
이 data.frame 객체는 word라는 컬럼만으로 구성되며
중복값을 가지고 있지 않아야 한다.
article_words <- article_sentences %>%
unnest_tokens(word,text) %>%
anti_join(stop_words,by='word')
EICUT을 실행한다.
article_summary <- eicut(article_sentences,article_words)
가장 정보량이 많은 10개의 상위 문서를추출한다.
result <- find_top_doc(article_summary)
result
결과를 보면 다음 컬럼으로 구성되어 있다.
- textrank_id
- sentence
- textrank
이제 result의 결과에서 키워드를 tidytext로 다시 끄집어내보자.
4회 이상 등장한 키워드는 주요 무엇일까?
result %>%
select(sentence) %>%
unnest_tokens(word,sentence) %>%
anti_join(stop_words) %>%
count(word) %>%
filter(n>=4) %>%
arrange(desc(n))
Joining with `by = join_by(word)`
이에 따라 현재 문서의 주요 토픽은 fitbit, children, tracker, ace라는
단어로
기술될 수 있음을 알 수 있다.
LS0tDQp0aXRsZTogIkVJQ1VULUFJIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KIyBJbnRyb2R1Y3Rpb24NCiMjIEVJQ1VULUFJDQoNCkV4dHJhY3RpbmcgSW5mb3JtYXRpdmUgQ2FzZXMgQmFzZWQgb24gVGV4dFJhbmsgLSBBSQ0KDQoqIEJ1c2luZXNzIERhdGEgTGFiDQoqIFRhZWt5dW5nIEtpbSwgUGhELiBQcm9mZXNzb3INCiogdGtfa2ltQGtodS5hYy5rcg0KDQojIFNvdWNlIENvZGUNCg0KIyMgTGlicmFyeQ0KDQrrnbzsnbTruIzrn6zrpqzqsIAg7ISk7LmY65CY7Ja0IOyeiOyngCDslYrri6TrqbQg64uk7J2MIOy9lOuTnOulvFwNCuyLpO2Wie2VtOyEnCDshKTsuZjtlZzri6QuDQoNCmBgYHtyfQ0KaW5zdGFsbC5wYWNrYWdlcygiZHBseXIiKQ0KaW5zdGFsbC5wYWNrYWdlcygidGlkeXRleHQiKQ0KaW5zdGFsbC5wYWNrYWdlcygidGV4dHJhbmsiKQ0KYGBgDQoNCu2VhOyalO2VnCDrnbzsnbTruIzrn6zrpqzrpbwg64uk7J2M6rO8IOqwmeydtCDrtojrn6zsmKjri6QuDQoNCmBgYHtyfQ0KbGlicmFyeShkcGx5cikNCmxpYnJhcnkodGlkeXRleHQpDQpsaWJyYXJ5KHRleHRyYW5rKQ0KYGBgDQoNCiMjIO2VqOyImA0KDQojIyMgdGV4dHJhbmvroZwgaW5mb3JtYXRpdmUgY2FzZeydmCDqsJIg6rOE7IKw7ZWY6riwDQoNCm9iaiDsnoXroKXrjbDsnbTthLDsnZgg7ZiV7Iud7J2AIGBkYXRhLmZyYW1lYCDtmLnsnYAgYHRpYmJsZWAg6rCd7LK064ukLlwNCuy7rOufvCDsoJXrs7TripQg64uk7J2M6rO8IOqwmeuLpC5cDQoqIOyyq+uyiOynuCDsu6zrn7zsnZgg7J2066aEID0gSUQgOiBJRA0KKiDrkZDrsojsp7gg7Lus65+87J2YIOydtOumhCA9IHRleHQgOiBzZW50ZW5jZSDtmLnsnYAgdGl0bGXqs7wg6rCZ7J20IOusuOyepeycvOuhnCDqtazrtoTrkJjripQg642w7J207YSwDQoNCuymiSwg7YWM7J2067iU7J2YIOyyq+uyiOynuCDsu6zrn7zsnYAgSUQsIOuRkOuyiOynuCDsu6zrn7zsnYAgdGV4dOuhnCDsnbTrpoTsnYQg66ee7Law7IScIOyeheugpe2VnOuLpC5cDQrrp4zslb0g7J2066aE7J20IOuLpOultOuLpOuptCBuYW1lcygp7ZWo7IiY66W8IOydtOyaqe2VtCDsnbTrpoTsnYQg7IiY7KCV7ZWc64ukLg0KDQpkaWN0aW9uYXJ564qUIOqzhOyCsOyXkCDsgqzsmqntlaAg7IKs7KCE7J2064ukLiDrp4zslb0g7IKs7KCE7J2EIOyeheugpe2VmOyngCDslYrsnLzrqbQg7KO87Ja07KeEXA0K642w7J207YSw7JeQ7IScIHRva2Vu7J2EIOy2lOy2nO2VmOyXrCDsgqzsmqntlZzri6QuXA0K6rSA7Ius7J2EIOuRkOqzoCDsnojripQg7YKk7JuM65Oc64KYIOuLqOyWtOulvCDtmZzsmqntlZjsl6wgYHRleHRyYW5rYCDqsrDqs7zrpbwg7KeI7KCB7Jy866GcXA0K7Luo7Yq466GkIO2VoCDsiJgg7J6I64ukLlwNCuyCrOyghOydgCB0aWJibGXsnbTrgpggZGF0YS5mcmFtZSDtmJXsi53snLzroZwsXA0K7LKr67KI7Ke4IOy7rOufvOydgCBJRCwg65GQ67KI7Ke4IOy7rOufvOydgCB3b3Jk652864qUIOydtOumhOydhCDqsIDsp4Dqs6Ag7J6I7Ja07JW8IO2VnOuLpC4gIA0K7KaJLCDqsIHqsIHsnZggdGV4dOyXkOyEnCDslrTrlqQgd29yZOulvCDqs4TsgrDtlaAg7KeA66W8IG1hcHBpbmftlZjripQg642w7J207YSw64ukLg0KDQpgYGB7cn0NCmVpY3V0IDwtIGZ1bmN0aW9uKG9iaixkaWN0aW9uYXJ5PU5VTEwpIHsNCiAgbmFtZXMob2JqKSA9IGMoIklEIiwidGV4dCIpDQogIGlmKGlzLm51bGwoZGljdGlvbmFyeSkpIHsNCiAgICBkaWNpb25hcnkgPC0gb2JqICU+JSB1bm5lc3RfdG9rZW5zKHdvcmQsdGV4dCkNCiAgfQ0KICB0ZXh0cmFua19zZW50ZW5jZXMoDQogICAgZGF0YSA9IG9iaiwNCiAgICB0ZXJtaW5vbG9neT1kaWN0aW9uYXJ5DQogICkNCn0NCmBgYA0KDQojIyMgaW5mb3JtYXRpdmUgaXRlbXMg6rOo65287Jik6riwDQoNCmBlaWN1dCgpYOydmCDqsrDqs7zrpbwgb2Jq66GcIOuwm+qzoCwgbj0g7Ji17IWY7J2EIOydtOyaqe2VtCDqsIDsnqUg7KCV67O065+J7J20IOunjuydgCDslYTsnbTthZzsnYQg7LaU7LacICANCg0KYGBge3J9DQpmaW5kX3RvcF9kb2MgPC0gZnVuY3Rpb24ob2JqLG49MTApIHsNCiAgYT1vYmpbWydzZW50ZW5jZXMnXV0gJT4lIA0KICAgIGFycmFuZ2UoZGVzYyh0ZXh0cmFuaykpICU+JSANCiAgICBzbGljZSgxOm4pDQogIGENCn0NCmBgYA0KDQoNCuyghOyytCDshozsiqTsvZTrk5zripQg64uk7J2M6rO8IOqwmeuLpC4NCg0KYGBgDQpsaWJyYXJ5KGRwbHlyKQ0KbGlicmFyeSh0aWR5dGV4dCkNCmxpYnJhcnkodGV4dHJhbmspDQplaWN1dCA8LSBmdW5jdGlvbihvYmosZGljdGlvbmFyeT1OVUxMKSB7DQogIG5hbWVzKG9iaikgPSBjKCJJRCIsInRleHQiKQ0KICBpZihpcy5udWxsKGRpY3Rpb25hcnkpKSB7DQogICAgZGljaW9uYXJ5IDwtIG9iaiAlPiUgdW5uZXN0X3Rva2Vucyh3b3JkLHRleHQpDQogIH0NCiAgdGV4dHJhbmtfc2VudGVuY2VzKA0KICAgIGRhdGEgPSBvYmosDQogICAgdGVybWlub2xvZ3k9ZGljdGlvbmFyeQ0KICApDQp9DQpmaW5kX3RvcF9kb2MgPC0gZnVuY3Rpb24ob2JqLG49MTApIHsNCiAgYT1vYmpbWydzZW50ZW5jZXMnXV0gJT4lIA0KICAgIGFycmFuZ2UoZGVzYyh0ZXh0cmFuaykpICU+JSANCiAgICBzbGljZSgxOm4pDQogIGENCn0NCmBgYA0KDQoNCiMg7JiI7KCcDQoNCuyWtOumsOydtOyaqSBGaXRiaXQgdHJhY2tlcuydmCDrgrTsmqnsnYQg67aE7ISd7ZWY64qUICANCuyYiOygnOulvCDsnpHshLHtlZzri6QuDQoNCmFydGljbGVfc2VudGVuY2Vz66W8IOuLpOydjOqzvCDqsJnsnbQg7ZqN65Od7ZWc64ukLiANCmBgYHtyfQ0KbGlicmFyeShydmVzdCkNCnVybCA8LSAiaHR0cDovL3RpbWUuY29tLzUxOTY3NjEvZml0Yml0LWFjZS1raWRzLWZpdG5lc3MtdHJhY2tlci8iDQphcnRpY2xlIDwtIHJlYWRfaHRtbCh1cmwpICU+JSANCiAgaHRtbF9ub2RlcygnZGl2W2NsYXNzPSJwYWRkZWQiXScpICU+JSANCiAgaHRtbF90ZXh0KCkNCmFydGljbGVfc2VudGVuY2VzIDwtIHRpYmJsZSgNCiAgdGV4dD1hcnRpY2xlDQopICU+JSANCiAgdW5uZXN0X3Rva2VucyhzZW50ZW5jZSx0ZXh0LHRva2VuPSdzZW50ZW5jZXMnKSAlPiUgDQogIG11dGF0ZShzZW50ZW5jZV9pZD1yb3dfbnVtYmVyKCkpICU+JSANCiAgc2VsZWN0KHNlbnRlbmNlX2lkLHNlbnRlbmNlKQ0KYGBgDQoNCuuNsOydtO2EsOydmCDsnbTrpoTsnYQgSUQg7JmAIHRleHQg66GcIOyImOygle2VnOuLpC4gIA0K7Ja065akIOuqqOyKteyduOyngCDtmZXsnbjtlZjquLAg7JyE7ZW0IOyymOydjCDrjbDsnbTthLAg66qHIOqwnOulvCDstpTroKTrs7TsnpAuICANCg0KYGBge3J9DQpuYW1lcyhhcnRpY2xlX3NlbnRlbmNlcykgPC0gYygiSUQiLCJ0ZXh0IikNCmhlYWQoYXJ0aWNsZV9zZW50ZW5jZXMpDQpgYGANCuyCrOyghOydhCDsoJzsnpHtlZzri6QuICANCuyYgeyWtCBzdG9wX3dvcmRz66W8IGFudGlfam9pbigp7Jy866GcIOygnOqxsO2VnOuLpC4gIA0K66eM7JW9IO2PrO2VqOyLnO2CrCDri6jslrTrp4zsnYQg7LaU66as66Ck66m0IHNlbWlfam9pbigp7J2EIO2ZnOyaqe2VmOyXrCAgDQrtj6ztlajsi5ztgqwg64uo7Ja06rCAIOyeiOuKlCBkYXRhLmZyYW1l7J2EIOyngOygle2VtOyVvCDtlZzri6QuICANCuydtCBkYXRhLmZyYW1lIOqwneyytOuKlCB3b3Jk652864qUIOy7rOufvOunjOycvOuhnCDqtazshLHrkJjrqbAgIA0K7KSR67O16rCS7J2EIOqwgOyngOqzoCDsnojsp4Ag7JWK7JWE7JW8IO2VnOuLpC4NCg0KYGBge3J9DQphcnRpY2xlX3dvcmRzIDwtIGFydGljbGVfc2VudGVuY2VzICU+JSANCiAgdW5uZXN0X3Rva2Vucyh3b3JkLHRleHQpICU+JSANCiAgYW50aV9qb2luKHN0b3Bfd29yZHMsYnk9J3dvcmQnKQ0KYGBgDQoNCkVJQ1VU7J2EIOyLpO2Wie2VnOuLpC4NCg0KYGBge3J9DQphcnRpY2xlX3N1bW1hcnkgPC0gZWljdXQoYXJ0aWNsZV9zZW50ZW5jZXMsYXJ0aWNsZV93b3JkcykNCmBgYA0KDQrqsIDsnqUg7KCV67O065+J7J20IOunjuydgCAxMOqwnOydmCDsg4HsnIQg66y47ISc66W87LaU7Lac7ZWc64ukLg0KDQpgYGB7cn0NCnJlc3VsdCA8LSBmaW5kX3RvcF9kb2MoYXJ0aWNsZV9zdW1tYXJ5KQ0KcmVzdWx0DQpgYGANCuqysOqzvOulvCDrs7TrqbQg64uk7J2MIOy7rOufvOycvOuhnCDqtazshLHrkJjslrQg7J6I64ukLiANCg0KKiB0ZXh0cmFua19pZA0KKiBzZW50ZW5jZQ0KKiB0ZXh0cmFuaw0KDQrsnbTsoJwgcmVzdWx07J2YIOqysOqzvOyXkOyEnCDtgqTsm4zrk5zrpbwgdGlkeXRleHTroZwg64uk7IucIOuBhOynkeyWtOuCtOuztOyekC4gIA0KNO2ajCDsnbTsg4Eg65Ox7J6l7ZWcIO2CpOybjOuTnOuKlCDso7zsmpQg66y07JeH7J286rmMPw0KDQpgYGB7cn0NCnJlc3VsdCAlPiUgDQogIHNlbGVjdChzZW50ZW5jZSkgJT4lIA0KICB1bm5lc3RfdG9rZW5zKHdvcmQsc2VudGVuY2UpICU+JSANCiAgYW50aV9qb2luKHN0b3Bfd29yZHMpICU+JSANCiAgY291bnQod29yZCkgJT4lIA0KICBmaWx0ZXIobj49NCkgJT4lIA0KICBhcnJhbmdlKGRlc2MobikpIA0KYGBgDQoNCuydtOyXkCDrlLDrnbwg7ZiE7J6sIOusuOyEnOydmCDso7zsmpQg7Yag7ZS97J2AIGZpdGJpdCwgY2hpbGRyZW4sIHRyYWNrZXIsIGFjZeudvOuKlCDri6jslrTroZwgIA0K6riw7Iig65CgIOyImCDsnojsnYzsnYQg7JWMIOyImCDsnojri6QuDQo=