IDF가 높다는 것은 문서 C에 주로 나타나거나 다른 문서에 포함되는 빈도
수가 적다는 것, 혹은 아예 없다는 것을 의미합니다.
install.packages("tidyverse")
## package 'tidyverse' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpyOagxz\downloaded_packages
install.packages("tidytext")
## package 'tidytext' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpyOagxz\downloaded_packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
# 예제 데이터
text_df <- tibble(
document = c("A", "B", "C"),
text = c("data science is fun",
"data model data",
"model and science"))
# 토큰화 및 불용어 제거
data("stop_words")
tidy_tokens <- text_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word")
# TF-IDF 계산
tfidf_result <- tidy_tokens %>%
count(document, word, sort = TRUE) %>%
bind_tf_idf(word, document, n)
tfidf_result
## # A tibble: 7 × 6
## document word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 B data 2 0.667 0.405 0.270
## 2 A data 1 0.333 0.405 0.135
## 3 A fun 1 0.333 1.10 0.366
## 4 A science 1 0.333 0.405 0.135
## 5 B model 1 0.333 0.405 0.135
## 6 C model 1 0.5 0.405 0.203
## 7 C science 1 0.5 0.405 0.203