Text Mining
Simple topic modelling with R
In this document I’ll step through creating a topic prediction model using the ‘tm’ package for text mining.
It will use Usenet data for both testing and training and fit a logistic regression model to classify text files into one of two categories.
1. Loading the data
Now let’s load in the raw training data for our two different topics - space and baseball - and convert them to a corpus object:
topic1_docs <- Corpus(
DirSource(
'~/RStuff/Intro to DS practicals/20news-data/20news-train/rec.sport.baseball', encoding = 'UTF-8'
)
)
topic2_docs <- Corpus(
DirSource(
'~/RStuff/Intro to DS practicals/20news-data/20news-train/sci.space', encoding = 'UTF-8'
)
)Now we need to label each of these docs with the correct topic, so we have something to test our model against. We do this by combining both corpuses into a single list, then creating a separate list of equal length with labels for each topic, then converting these labels to either ‘1’ or ‘0’ so they can later be used in a boolean comparison with our model output:
classifier_docs <- c(as.list(topic1_docs), as.list(topic2_docs))
topic1_label <- replicate(length(topic1_docs), 'baseball')
topic2_label <- replicate(length(topic2_docs), 'space')
classifier_label <- c(topic1_label, topic2_label)
classifier_label <- (classifier_label == 'baseball') * 1Next we convert the combined list back into a single Corpus, and do the usual pre-processing i.e. making lowercase -> removing punctuation -> removing stopwords -> stemming:
classifier_docs <- Corpus(VectorSource(classifier_docs))
classifier_docs_cleaned <- classifier_docs |>
tm_map(tolower) |>
tm_map(removePunctuation) |>
tm_map(removeWords, stopwords('en')) |>
tm_map(stemDocument)Now we do the exact same thing for the test data:
topic1_docs_test <- Corpus(
DirSource(
'~/RStuff/Intro to DS practicals/20news-data/20news-test/rec.sport.baseball', encoding = 'UTF-8'
)
)
topic2_docs_test <- Corpus(
DirSource(
'~/RStuff/Intro to DS practicals/20news-data/20news-test/sci.space', encoding = 'UTF-8'
)
)
classifier_docs_test <- c(as.list(topic1_docs_test), as.list(topic2_docs_test))
topic1_label_test <- replicate(length(topic1_docs_test), 'baseball')
topic2_label_test <- replicate(length(topic2_docs_test), 'space')
classifier_label_test <- c(topic1_label_test, topic2_label_test)
classifier_label_test <- (classifier_label_test == 'baseball') * 1
classifier_docs_test <- Corpus(VectorSource(classifier_docs_test))
classifier_docs_test_cleaned <- classifier_docs_test |>
tm_map(tolower) |>
tm_map(removePunctuation) |>
tm_map(removeWords, stopwords('en')) |>
tm_map(stemDocument)2. Creating the model
The model involves three steps:
- Extract the vocabulary from the text files
- Use this vocab to create a training DTM from the training data
- Build a logistic regression model using the training data
- Create another DTM from the test data to test the model’s accuracy
Step 1: let’s extract the vocab by creating a ‘master’ DTM from the whole corpus, remove sparse terms and then extract the text into a list:
classifier_dtm <- DocumentTermMatrix(
classifier_docs_cleaned
)
classifier_dtm <- removeSparseTerms(classifier_dtm, 0.99)
source_vocab <- unlist(classifier_dtm$dimnames)Step 2: now we make a training DTM using the source vocab extracted in step one as the ‘dictionary’ argument, using a TfIdf weighting method (the standard):
classifier_dtm_train <- DocumentTermMatrix(
classifier_docs_cleaned,
control = list(
dictionary = source_vocab,
weighting = weightTfIdf
)
)Step 3: now we build the actual logistic regression model using the training DTM as the input and the labels we manually assigned as the classifier:
classifier_model <- glmnet(
classifier_dtm_train,
classifier_label,
family = 'binomial'
)Step 4: create the ‘test’ DTM to run the model against, convert to a matrix and run the model
classifier_dtm_test <- DocumentTermMatrix(
classifier_docs_test_cleaned,
control = list(
dictionary = source_vocab,
weighting = weightTfIdf
)
)
classifier_dtm_test <- data.matrix(classifier_dtm_test)
classifier_probabilities <- predict(
classifier_model,
classifier_dtm_test,
s=tail(classifier_model$lambda, 1),
type = 'response'
)3. Model Output
Now we need to see how good our model is:
classifier_binary <- ifelse(
classifier_probabilities > 0.5,
1,
0
)
classification_error <- mean(
classifier_binary != classifier_label_test
)
classifier_accuracy <- 1 - classification_errorSo our ultimate accuracy is 0.5929204. That’s not very good!