This document will go through the steps of how to process text data into a dataframe that models can be build upon.
For processing the text data (no model building yet), I just need the quanteda package. I’ll also load rmarkdown and knitr in case I want to knit this document and/or use any special rmarkdown features.
# Load Libraries
library(quanteda)
library(dplyr)
library(rmarkdown)
library(knitr)
# Import raw data
setwd("P:/")
raw_data <- read.csv("SPAM text message 20170820 - Data.csv", nrows = 2000)
I loaded the first 2000 rows of data in. This will make it less time consuming to train my eventual models. Then when I find a model I like, I can run it on the entire data. I will opt to transform my data first, then split that resulting clean data into training and testing sets.
This is the most difficult part. With text data, you first need to turn data into corpus.
This is easy. Functions like corpus() or VCorpus() take care of this for you. Which function depends on what kind of library you’ll be using to do the cleaning tasks. We’re using the quanteda package so will use the corpus() function.
# Convert text data into class character
raw_data$Message <- as.character(raw_data$Message)
# Stored as corpus
mycorp <- corpus(raw_data$Message)
Let’s see what we’ve got.
# See what data looks like - optional
mycorp[10]
## text10
## "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
OK we’ve got the text data in that corpus format that R wants.
Next step is to turn the text data into tokens.
# Tokenize the corpus
mytokens <- tokens(mycorp, remove_punct = TRUE)
# See what data looks like - optional
mytokens[10]
## tokens from 1 document.
## text10 :
## [1] "Had" "your" "mobile" "11" "months"
## [6] "or" "more" "U" "R" "entitled"
## [11] "to" "Update" "to" "the" "latest"
## [16] "colour" "mobiles" "with" "camera" "for"
## [21] "Free" "Call" "The" "Mobile" "Update"
## [26] "Co" "FREE" "on" "08002986030"
Then you need to do some “cleaning” of the tokens data by considering questions like:
* Do I want to stem the words?
* Do I want to disregard capitalization and punctuation?
* Do I want to remove stopwords or numbers?
Here is one way to do that.
# Word stem the tokens
newtokens <- tokens_wordstem(mytokens)
# See what data looks like - optional
newtokens[10]
## tokens from 1 document.
## text10 :
## [1] "Had" "your" "mobil" "11" "month"
## [6] "or" "more" "U" "R" "entitl"
## [11] "to" "Update" "to" "the" "latest"
## [16] "colour" "mobil" "with" "camera" "for"
## [21] "Free" "Call" "The" "Mobil" "Update"
## [26] "Co" "FREE" "on" "08002986030"
# Lowercase the tokens
newtokens2 <- tokens_tolower(newtokens)
# See what data looks like - optional
newtokens2[10]
## tokens from 1 document.
## text10 :
## [1] "had" "your" "mobil" "11" "month"
## [6] "or" "more" "u" "r" "entitl"
## [11] "to" "update" "to" "the" "latest"
## [16] "colour" "mobil" "with" "camera" "for"
## [21] "free" "call" "the" "mobil" "update"
## [26] "co" "free" "on" "08002986030"
In quanteda, stopwords get removed inside the dfm() function.
# Create DFM
mydfm <- dfm(newtokens2, remove = stopwords())
Next, lets remove sparse terms from the DFM. min_count will set the minimum number of total occurrences in the DFM that must appear for the word to be kept in. min_docfreq sets the minimum number of different documents that word has to appear in.
# Trim DFM with dfm_trim() function
trimdfm <- dfm_trim(mydfm, min_count = 10, min_docfreq = 5)
# See what data looks like - optional
trimdfm[1:10, 1:15]
## Document-feature matrix of: 10 documents, 15 features (86.7% sparse).
## 10 x 15 sparse Matrix of class "dfm"
## features
## docs go point onli n great world e got wat ok lar u free entri 2
## text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
## text2 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0
## text3 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1
## text4 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## text7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## text8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## text9 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## text10 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0
# Turn DFM into a dataframe
mydf <- data.frame(trimdfm)
The first variable in this dataframe is useless. I also want to append my target variable on and call it “Category”.
# Drop "document" variable
mydf <- select(mydf, -document)
# Append variables on to that dataframe
mydf2 <- cbind(raw_data$Category, mydf)
# Rename the raw_data$Category variable to "Category"
names(mydf2)[1] <- "Category"
Now I’ve got a dataframe of data that I can train a model on.
Let’s save this file that way we can easily load it into other documents and do some model building.
# Save mydf2
getwd()
## [1] "P:/NS DMC 2018"
saveRDS(mydf2, "text_df_2.Rds")
Now that mydf2 dataframe has been saved onto my computer. Specifically, it got saved into wherever my working directory is. You can view or change your working directory right above that saveRDS() function to see or set the folder you want the file saved into.
Now, we can use readRDS() later to pull this nice dataframe back into your environment and start building models on it.