Text Mining

Simple topic modelling with R

In this document I’ll step through creating a topic prediction model using the ‘tm’ package for text mining.

It will use Usenet data for both testing and training and fit a logistic regression model to classify text files into one of two categories.

1. Loading the data

Now let’s load in the raw training data for our two different topics - space and baseball - and convert them to a corpus object:

topic1_docs <- Corpus(
  DirSource(
    '~/RStuff/Intro to DS practicals/20news-data/20news-train/rec.sport.baseball', encoding = 'UTF-8'
  )
)

topic2_docs <- Corpus(
  DirSource(
    '~/RStuff/Intro to DS practicals/20news-data/20news-train/sci.space', encoding = 'UTF-8'
  )
)

Now we need to label each of these docs with the correct topic, so we have something to test our model against. We do this by combining both corpuses into a single list, then creating a separate list of equal length with labels for each topic, then converting these labels to either ‘1’ or ‘0’ so they can later be used in a boolean comparison with our model output:

classifier_docs <- c(as.list(topic1_docs), as.list(topic2_docs))

topic1_label <- replicate(length(topic1_docs), 'baseball')
topic2_label <- replicate(length(topic2_docs), 'space')
classifier_label <- c(topic1_label, topic2_label)

classifier_label <- (classifier_label == 'baseball') * 1

Next we convert the combined list back into a single Corpus, and do the usual pre-processing i.e. making lowercase -> removing punctuation -> removing stopwords -> stemming:

classifier_docs <- Corpus(VectorSource(classifier_docs))
classifier_docs_cleaned <- classifier_docs |>
  tm_map(tolower) |>
  tm_map(removePunctuation) |>
  tm_map(removeWords, stopwords('en')) |>
  tm_map(stemDocument)

Now we do the exact same thing for the test data:

topic1_docs_test <- Corpus(
  DirSource(
    '~/RStuff/Intro to DS practicals/20news-data/20news-test/rec.sport.baseball', encoding = 'UTF-8'
  )
)

topic2_docs_test <- Corpus(
  DirSource(
    '~/RStuff/Intro to DS practicals/20news-data/20news-test/sci.space', encoding = 'UTF-8'
  )
)

classifier_docs_test <- c(as.list(topic1_docs_test), as.list(topic2_docs_test))

topic1_label_test <- replicate(length(topic1_docs_test), 'baseball')
topic2_label_test <- replicate(length(topic2_docs_test), 'space')
classifier_label_test <- c(topic1_label_test, topic2_label_test)

classifier_label_test <- (classifier_label_test == 'baseball') * 1

classifier_docs_test <- Corpus(VectorSource(classifier_docs_test))
classifier_docs_test_cleaned <- classifier_docs_test |>
  tm_map(tolower) |>
  tm_map(removePunctuation) |>
  tm_map(removeWords, stopwords('en')) |>
  tm_map(stemDocument)

2. Creating the model

The model involves three steps:

  1. Extract the vocabulary from the text files
  2. Use this vocab to create a training DTM from the training data
  3. Build a logistic regression model using the training data
  4. Create another DTM from the test data to test the model’s accuracy

Step 1: let’s extract the vocab by creating a ‘master’ DTM from the whole corpus, remove sparse terms and then extract the text into a list:

classifier_dtm <- DocumentTermMatrix(
  classifier_docs_cleaned
)
classifier_dtm <- removeSparseTerms(classifier_dtm, 0.99)
source_vocab <- unlist(classifier_dtm$dimnames)

Step 2: now we make a training DTM using the source vocab extracted in step one as the ‘dictionary’ argument, using a TfIdf weighting method (the standard):

classifier_dtm_train <- DocumentTermMatrix(
  classifier_docs_cleaned,
  control = list(
    dictionary = source_vocab,
    weighting = weightTfIdf
  )
)

Step 3: now we build the actual logistic regression model using the training DTM as the input and the labels we manually assigned as the classifier:

classifier_model <- glmnet(
  classifier_dtm_train,
  classifier_label,
  family = 'binomial'
)

Step 4: create the ‘test’ DTM to run the model against, convert to a matrix and run the model

classifier_dtm_test <- DocumentTermMatrix(
  classifier_docs_test_cleaned,
  control = list(
    dictionary = source_vocab,
    weighting = weightTfIdf
  )
)

classifier_dtm_test <- data.matrix(classifier_dtm_test)

classifier_probabilities <- predict(
  classifier_model,
  classifier_dtm_test,
  s=tail(classifier_model$lambda, 1),
  type = 'response'
)

3. Model Output

Now we need to see how good our model is:

classifier_binary <- ifelse(
  classifier_probabilities > 0.5,
  1,
  0
)

classification_error <- mean(
  classifier_binary != classifier_label_test
)

classifier_accuracy <- 1 - classification_error

So our ultimate accuracy is 0.5929204. That’s not very good!