607_Project4

I have chosen to use medical research papers to build a document.classifier. And I will be using the National Institutes of Health (NIH)’s PubMed Central (PMC) open access author manuscript database for this purpose.

Loading the packages.

library(tm)
library(textTinyR)
library(stringr)
library(e1071)
library(caret)

Loading the documents.

# using the tm package.

# List all text files in the specified directory
file_names <- list.files(path = path, pattern = "\\.txt$", full.names = TRUE)

# loading the labels to serve as ground truths
editorial_files <- list.files(path = path_editorials, full.names = TRUE)
original_files <- list.files(path = path_originals, full.names = TRUE)

# Create vectors of labels corresponding to the files
labels_editorials <- rep("editorial/review", length(editorial_files))
labels_originals <- rep("original research", length(original_files))

# Combine file names and labels
file_names <- c(editorial_files, original_files)
labels <- c(labels_editorials, labels_originals)

# Create a data frame for easier handling
data_files <- data.frame(file_name = file_names, label = labels, stringsAsFactors = FALSE)


# Read each file and store the contents in a list
text_data <- lapply(file_names, function(file) {
  paste(readLines(con = file, warn = FALSE), collapse = " ")
})

# a corpus
corpus <- Corpus(VectorSource(text_data))

Data Pre-Processing.

In here, there are a few things that I want to do:

transform all text to lowercase for better normalization
Each text file has this sentence mentioned twice “This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law.” I want to remove it from all the text files.
Remove punctuations, stopwords, and whitespace.
I will use stemming to reduce words to their base/root.

# using the stringr package.

# Sentence to remove
sentence_to_remove <- "This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."
sentence_to_remove2 <- "LICENSE: This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law."


# Read files and remove the specific sentence
text_data <- lapply(file_names, function(file) {
  text <- paste(readLines(con = file, warn = FALSE), collapse = " ")
  text <- str_replace(text, sentence_to_remove, "")
  text <- str_replace(text, sentence_to_remove2, "")
  return(text)
})

# Text pre-processing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)

Feature Extraction.

Creating the document-term matrix (DTM) to describe the frequency of terms that occur in a collection of documents.

# relying on the tm package.

dtm <- DocumentTermMatrix(corpus)

TF-IDF (Term Frequency-Inverse Document Frequency) Weighting.

To evaluate how important a word is to a document in a collection or corpus.

# relying on the tm package.
tfidf <- weightTfIdf(dtm)

# alternative:
# library(textTinyR)
# Convert DTM to matrix
# dtm_matrix <- as.matrix(dtm)
# Use the textTinyR package to compute TF-IDF
#tfidf_matrix <- textTinyR::TF_IDF(dtm = dtm_matrix, document_column = 1)

Machine Learning for Classification (using an SVM model).

set.seed(123) # for reproducibility

# training set (70% of the data)
train_indices <- sample(1:nrow(tfidf), 0.7 * nrow(tfidf))

# training and testing sets for TF-IDF data
train_tfidf <- tfidf[train_indices, ]
test_tfidf <- tfidf[-train_indices, ]

# training and testing sets for labels
train_labels <- labels[train_indices]
test_labels <- labels[-train_indices]

# Train the SVM model
model <- svm(train_tfidf, as.factor(train_labels))

Model Evaluation.

# library(caret)
# predict on the test set
predictions <- predict(model, test_tfidf)

# evaluation using a confusion matrix
conf_matrix <- confusionMatrix(as.factor(predictions), as.factor(test_labels))

# confusion matrix along with stats
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          editorial/review original research
##   editorial/review                 0                 0
##   original research               10                73
##                                           
##                Accuracy : 0.8795          
##                  95% CI : (0.7896, 0.9407)
##     No Information Rate : 0.8795          
##     P-Value [Acc > NIR] : 0.583212        
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.004427        
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8795          
##              Prevalence : 0.1205          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : editorial/review
##

** INTERPRETATION.**

There were no instances where the model correctly predicted “editorial/review” (hence the 0 in the top-left corner). The model failed to identify 10 documents that were actually “editorial/review”, classifying them instead as “original research”. (10 false negatives, 0 true negatives).
There were no instances where “original research” papers were incorrectly classified as “editorial/review” (hence the 0 in the top-right corner). The model correctly identified 73 documents as “original research”. (10 true positives, 0 false positives).
An accuracy of 87.95% indicates a high overall rate of correct predictions, but I think only because the training set was imbalanced (30 vs. 243).
No Information Rate (NIR, 0.8795): This is the accuracy that could be achieved by always predicting the most frequent class. Since it equals the model accuracy, the model doesn’t really do better than baseline predictions. P-Value [Acc > NIR] (0.583212): This tests whether the model’s accuracy is significantly different from the NIR.Suggesting that the model might not be performing better than trivial predictions.
Kappa (0): The Cohen’s Kappa is a statistic that measures inter-rater agreement for qualitative (categorical) items. A value of 0 indicates that the agreement is no better than chance.
Mcnemar’s Test P-Value (0.004427): This test checks the symmetry of the confusion matrix, essentially testing whether the FN and FP are significantly different. A value less than 0.05 indicates a significant difference, often pointing to a bias in misclassification one way or another.
Sensitivity (0.0000): Also known as recall or true positive rate, indicates the proportion of actual positives that are correctly identified. A value of 0 suggests that the model failed to correctly identify any “editorial/review” papers.
Specificity (1.0000): Measures the proportion of actual negatives that are correctly identified. A value of 1 is excellent, indicating no “original research” papers were incorrectly labeled as “editorial/review”.
Positive Predictive Value (NaN) and Negative Predictive Value (0.8795): PPV is undefined (NaN) here due to zero predicted positives (division by zero issue), and NPV is high because the model is good at predicting the negative class (“original research”).

It is important to note here that I built the labels myself one paper at a time, which took a LONG time on my end to finish. But the number is still small. This emphasizes the need for labeling and how time-intensive it is to do well and to do in large numbers. I need a larger training set that would offer a larger number of both classes (especially the “editorials/reviews”).

An example research paper for each category:

An Original Paper:

Original Research Example

An Editorial/Review Paper:

Editorial/Review Example