I have chosen to use medical research papers to build a document.classifier. And I will be using the National Institutes of Health (NIH)’s PubMed Central (PMC) open access author manuscript database for this purpose.
library(tm)
library(textTinyR)
library(stringr)
library(e1071)
library(caret)
# using the tm package.
# List all text files in the specified directory
file_names <- list.files(path = path, pattern = "\\.txt$", full.names = TRUE)
# loading the labels to serve as ground truths
editorial_files <- list.files(path = path_editorials, full.names = TRUE)
original_files <- list.files(path = path_originals, full.names = TRUE)
# Create vectors of labels corresponding to the files
labels_editorials <- rep("editorial/review", length(editorial_files))
labels_originals <- rep("original research", length(original_files))
# Combine file names and labels
file_names <- c(editorial_files, original_files)
labels <- c(labels_editorials, labels_originals)
# Create a data frame for easier handling
data_files <- data.frame(file_name = file_names, label = labels, stringsAsFactors = FALSE)
# Read each file and store the contents in a list
text_data <- lapply(file_names, function(file) {
paste(readLines(con = file, warn = FALSE), collapse = " ")
})
# a corpus
corpus <- Corpus(VectorSource(text_data))
In here, there are a few things that I want to do:
# using the stringr package.
# Sentence to remove
sentence_to_remove <- "This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."
sentence_to_remove2 <- "LICENSE: This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."
# Read files and remove the specific sentence
text_data <- lapply(file_names, function(file) {
text <- paste(readLines(con = file, warn = FALSE), collapse = " ")
text <- str_replace(text, sentence_to_remove, "")
text <- str_replace(text, sentence_to_remove2, "")
return(text)
})
# Text pre-processing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
Creating the document-term matrix (DTM) to describe the frequency of terms that occur in a collection of documents.
# relying on the tm package.
dtm <- DocumentTermMatrix(corpus)
To evaluate how important a word is to a document in a collection or corpus.
# relying on the tm package.
tfidf <- weightTfIdf(dtm)
# alternative:
# library(textTinyR)
# Convert DTM to matrix
# dtm_matrix <- as.matrix(dtm)
# Use the textTinyR package to compute TF-IDF
#tfidf_matrix <- textTinyR::TF_IDF(dtm = dtm_matrix, document_column = 1)
set.seed(123) # for reproducibility
# training set (70% of the data)
train_indices <- sample(1:nrow(tfidf), 0.7 * nrow(tfidf))
# training and testing sets for TF-IDF data
train_tfidf <- tfidf[train_indices, ]
test_tfidf <- tfidf[-train_indices, ]
# training and testing sets for labels
train_labels <- labels[train_indices]
test_labels <- labels[-train_indices]
# Train the SVM model
model <- svm(train_tfidf, as.factor(train_labels))
# library(caret)
# predict on the test set
predictions <- predict(model, test_tfidf)
# evaluation using a confusion matrix
conf_matrix <- confusionMatrix(as.factor(predictions), as.factor(test_labels))
# confusion matrix along with stats
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction editorial/review original research
## editorial/review 0 0
## original research 10 73
##
## Accuracy : 0.8795
## 95% CI : (0.7896, 0.9407)
## No Information Rate : 0.8795
## P-Value [Acc > NIR] : 0.583212
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.004427
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8795
## Prevalence : 0.1205
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : editorial/review
##
** INTERPRETATION.**
It is important to note here that I built the labels myself one paper at a time, which took a LONG time on my end to finish. But the number is still small. This emphasizes the need for labeling and how time-intensive it is to do well and to do in large numbers. I need a larger training set that would offer a larger number of both classes (especially the “editorials/reviews”).
.
.
.
An Original Paper:
An Editorial/Review Paper: