How to Process Text Data With Quanteda

This document will go through the steps of how to process text data into a dataframe that models can be build upon.

Step 3 - Transform Data

This is the most difficult part. With text data, you first need to turn data into corpus.

3a - Making a corpus

This is easy. Functions like corpus() or VCorpus() take care of this for you. Which function depends on what kind of library you’ll be using to do the cleaning tasks. We’re using the quanteda package so will use the corpus() function.

# Convert text data into class character
raw_data$Message <- as.character(raw_data$Message)

# Stored as corpus
mycorp <- corpus(raw_data$Message)

Let’s see what we’ve got.

# See what data looks like - optional
mycorp[10]

##                                                                                                                                                       text10 
## "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"

OK we’ve got the text data in that corpus format that R wants.

3b - Tokenize the text

Next step is to turn the text data into tokens.

# Tokenize the corpus
mytokens <- tokens(mycorp, remove_punct = TRUE)

# See what data looks like - optional
mytokens[10]

## tokens from 1 document.
## text10 :
##  [1] "Had"         "your"        "mobile"      "11"          "months"     
##  [6] "or"          "more"        "U"           "R"           "entitled"   
## [11] "to"          "Update"      "to"          "the"         "latest"     
## [16] "colour"      "mobiles"     "with"        "camera"      "for"        
## [21] "Free"        "Call"        "The"         "Mobile"      "Update"     
## [26] "Co"          "FREE"        "on"          "08002986030"

3c - “Clean” the tokens

Then you need to do some “cleaning” of the tokens data by considering questions like:
* Do I want to stem the words?
* Do I want to disregard capitalization and punctuation?
* Do I want to remove stopwords or numbers?

Here is one way to do that.

# Word stem the tokens
newtokens <- tokens_wordstem(mytokens)

# See what data looks like - optional
newtokens[10]

## tokens from 1 document.
## text10 :
##  [1] "Had"         "your"        "mobil"       "11"          "month"      
##  [6] "or"          "more"        "U"           "R"           "entitl"     
## [11] "to"          "Update"      "to"          "the"         "latest"     
## [16] "colour"      "mobil"       "with"        "camera"      "for"        
## [21] "Free"        "Call"        "The"         "Mobil"       "Update"     
## [26] "Co"          "FREE"        "on"          "08002986030"

# Lowercase the tokens
newtokens2 <- tokens_tolower(newtokens)

# See what data looks like - optional
newtokens2[10]

## tokens from 1 document.
## text10 :
##  [1] "had"         "your"        "mobil"       "11"          "month"      
##  [6] "or"          "more"        "u"           "r"           "entitl"     
## [11] "to"          "update"      "to"          "the"         "latest"     
## [16] "colour"      "mobil"       "with"        "camera"      "for"        
## [21] "free"        "call"        "the"         "mobil"       "update"     
## [26] "co"          "free"        "on"          "08002986030"

In quanteda, stopwords get removed inside the dfm() function.

3d - Create a DFM

# Create DFM
mydfm <- dfm(newtokens2, remove = stopwords())

3e - Trim the DFM

Next, lets remove sparse terms from the DFM. min_count will set the minimum number of total occurrences in the DFM that must appear for the word to be kept in. min_docfreq sets the minimum number of different documents that word has to appear in.

# Trim DFM with dfm_trim() function
trimdfm <- dfm_trim(mydfm, min_count = 10, min_docfreq = 5)

# See what data looks like - optional
trimdfm[1:10, 1:15]

## Document-feature matrix of: 10 documents, 15 features (86.7% sparse).
## 10 x 15 sparse Matrix of class "dfm"
##         features
## docs     go point onli n great world e got wat ok lar u free entri 2
##   text1   1     1    1 1     1     1 1   1   1  0   0 0    0     0 0
##   text2   0     0    0 0     0     0 0   0   0  1   1 1    0     0 0
##   text3   0     0    0 0     0     0 0   0   0  0   0 0    1     2 1
##   text4   0     0    0 0     0     0 0   0   0  0   0 2    0     0 0
##   text5   0     0    0 0     0     0 0   0   0  0   0 0    0     0 0
##   text6   0     0    0 0     0     0 0   0   0  1   0 0    0     0 0
##   text7   0     0    0 0     0     0 0   0   0  0   0 0    0     0 0
##   text8   0     0    0 0     0     0 0   0   0  0   0 0    0     0 0
##   text9   0     0    1 0     0     0 0   0   0  0   0 0    0     0 0
##   text10  0     0    0 0     0     0 0   0   0  0   0 1    2     0 0

3f - Turn DFM data into a data.frame

# Turn DFM into a dataframe
mydf <- data.frame(trimdfm)

The first variable in this dataframe is useless. I also want to append my target variable on and call it “Category”.

# Drop "document" variable
mydf <- select(mydf, -document)

# Append variables on to that dataframe
mydf2 <- cbind(raw_data$Category, mydf)

# Rename the raw_data$Category variable to "Category"
names(mydf2)[1] <- "Category"

Now I’ve got a dataframe of data that I can train a model on.

Let’s save this file that way we can easily load it into other documents and do some model building.

# Save mydf2
getwd()
## [1] "P:/NS DMC 2018"
saveRDS(mydf2, "text_df_2.Rds")

Now that mydf2 dataframe has been saved onto my computer. Specifically, it got saved into wherever my working directory is. You can view or change your working directory right above that saveRDS() function to see or set the folder you want the file saved into.

Now, we can use readRDS() later to pull this nice dataframe back into your environment and start building models on it.

How to Process Text Data With Quanteda

Billy Jackson, North Shore Community College

4/13/2018

Step 1 - Load Libraries