Cleaning and Data Preparation
First, we will import, clean, and process the data for classification.
Datasets
I found a large library of datasets here. Below is an excerpt from the site regarding a library of resources.
The email spam messages are collected from:
The ENRON email archive
The Apache Spam Assassin dataset To make the work simpler, the two datasets are put into a single zip file here (107MB, contains about 60K files).
The SMS dataset is from: SMS data
For Reference on class labels
SOURCES = [
(‘data/spam’, SPAM),
(‘data/spam_2, SPAM),
(’data/easy_ham_2, HAM),
(’data/easy_ham’, HAM),
(‘data/hard_ham’, HAM),
(‘data/beck-s’, HAM),
(‘data/farmer-d’, HAM),
(‘data/kaminski-v’, HAM),
(‘data/kitchen-l’, HAM),
(‘data/lokay-m’, HAM),
(‘data/williams-w3’, HAM),
(‘data/BG’, SPAM),
(‘data/GP’, SPAM),
(‘data/SH’, SPAM)
]
Read Files
I divided the files into two sets for spam and ham. There are ~67K files spread out over many folders, I will load them locally and read in the files. The zip files containing the data is on github.
Reproduction Note: If you would like to reproduce these files, switch the below code chunk from ‘markdown’ to ‘r’ and execute. These are large files (too large to read directly from github) with many subdirectories so it will be faster and easier to unzip locally and read the files.
data_path <-paste0(getwd(),'/data')
download.file(
url='https://github.com/Shampjeff/cuny_msds/blob/master/DATA_607/data/data.zip',
destfile=data_path,
method='curl'
)
unzip("data.zip")
ham_path<-paste0(data_path,'/ham')
spam_path<-paste0(data_path, '/spam')
Using local files for data extraction since they are already downloaded.
This function crawls through the files and extracts the messages in there raw form.
make.data.frame<- function(path, class){
# Dig through the directories for messages
files <- list.files(path=path,
full.names=TRUE,
recursive=TRUE)
# Read a file once directories are gone
message<-lapply(files, function(x) {
text_body<-read_file(x)
})
# Add to dataframe and assign "id" column
message<-unlist(message)
data<-as.data.frame(message)
data$class<-class
return (data)
}
Make SPAM and HAM dataframes and bind them.
data<-make.data.frame(spam_path, class="SPAM")
data<-rbind(data, make.data.frame(ham_path, class="HAM"))
The SMS dataset can be bound as well. Again, please see the zip files on github or run the markdown chunk as R with sms_path
in read_lines
function to download and extract.
data_path <-paste0(getwd(),'/smsspamcollection')
download.file(
url='https://github.com/Shampjeff/cuny_msds/blob/master/DATA_607/data/smsspamcollection.zip',
destfile=data_path,
method='curl'
)
unzip("smsspamcollection.zip")
sms_path<-paste0(data_path,'/SMSSpamCollection')
Using the local file since it is already downloaded.
sms_data<- as.data.frame(
read_lines(
"/Users/jeffshamp/Downloads/smsspamcollection/SMSSpamCollection"
))
names(sms_data)<-"lines"
sms_data<-sms_data %>%
separate(col = lines, into = c("class", "message"), sep = "\t") %>%
mutate(class = str_to_upper(class)) %>%
mutate(message = factor(message))
data<-rbind(data, sms_data)
Best of Text Messages
Let’s take a look at the messages and see what kind of divine prose our writers produce.
“Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer’s lease hath all too short a date;
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm’d;
And every fair from fair sometime declines,
By chance or nature’s changing course untrimm’d;
But thy eternal summer shall not fade,
Nor lose possession of that fair thou ow’st;
Nor shall death brag thou wander’st in his shade,
When in eternal lines to time thou grow’st:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.”
The above was not a text message from SMS dataset. Our most verbose actor had the following, similar take on love…
best_of<-sms_data %>%
mutate(length = str_length(message)) %>%
arrange(desc(length))
str_split(best_of[1,'message'], "[.]")[[1]]
## [1] "For me the love should start with attraction"
## [2] "i should feel that I need her every time around me"
## [3] "she should be the first thing which comes in my thoughts"
## [4] "I would start the day and end it with her"
## [5] "she should be there every time I dream"
## [6] "love will be then when my every breath has her name"
## [7] "my life should happen around her"
## [8] "my life will be named to her"
## [9] "I would cry for her"
## [10] "will give all my happiness and take all her sorrows"
## [11] "I will be ready to fight with anyone for her"
## [12] "I will be in love when I will be doing the craziest things for her"
## [13] "love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet"
## [14] "I will always be singing praises for her"
## [15] "love will be when I start up making chicken curry and end up makiing sambar"
## [16] "life will be the most beautiful then"
## [17] "will get every morning and thank god for the day because she is with me"
## [18] "I would like to say a lot"
## [19] ""
## [20] "will tell later"
## [21] ""
## [22] ""
O Romeo, Romeo, wherefore art thou Romeo?
Text Clean Up
Some classifiers (like XGBoost) need the target class to numerical, others do not (like Naive Bayes). I’ll make both so that we can try out different modeling methods.
data_spam<-data %>%
filter(class == "SPAM") %>%
mutate(target = 1)
data_ham<- data %>%
filter(class == "HAM") %>%
mutate(target = 0)
data<-rbind(data_spam, data_ham)
data$id <- 1:nrow(data)
DT::datatable(data %>%
count(class, target),
extensions = c('FixedColumns',"FixedHeader"),
options = list(scrollX = TRUE,
paging=TRUE,
fixedHeader=TRUE))
Now we need to clean the text files. Below we remove the html formatting, all punctuation, new lines, and digits from the text. We will also remove all stop words in the tidy lexicon. Maybe we can remove more later is the model isn’t up to par.
data<-data %>%
mutate(message= str_remove_all(message, pattern = "<.*?>")) %>%
mutate(message= str_remove_all(message, pattern = "[:digit:]")) %>%
mutate(message= str_remove_all(message, pattern = "[:punct:]")) %>%
mutate(message= str_remove_all(message, pattern = "[\n]")) %>%
mutate(message= str_to_lower(message)) %>%
unnest_tokens(output=text,input=message,
token="paragraphs",
format="text") %>%
anti_join(stop_words, by=c("text"="word"))
These are currently ordered, as in all the Ham is first followed by all the Spam. We need to shuffle this order for later use in the train, test splits.
Document Term Matrix
We need to vectorize the words in the corpus of messages and tm
package seems to be a good job of handling that. We are also stemming the words while making the matrix since it seemed easier to do here than using tidyverse. The tm
package offer a lot of really good functionality for this kind of work.
Now that this is a corpus we can produce a Document Term Matrix, which is a bag-of-words vectorizer for each message in the dataset. Without removing any sparse terms (below) this will produce a huge matrix that that tens of thousands of columns. Each column represents the count frequency of each word in the corpus.
# For tokens by message
text_dtm <- DocumentTermMatrix(text_corpus, control =
list(stemming = TRUE))
dim(text_dtm)
## [1] 67249 716207
We will reduce the number of columns to only the words that show up frequently (and are not stopwords). Reducing the dataset is very helpful in improving the predictive power of the model. Too many words “confuses” the algorithm and generally returns an overly simple model. For example, when I ran this without any term frequency filtering, every model just labeled all messages as SPAM.
We will also add a TFIDF score to each word to weight it’s relative frequency within each message. This is a great way to feature engineer without adding columns. Now the matrix shows word frequency by corpus and message.
# Filter out sparse terms
text_dtm <- removeSparseTerms(text_dtm,sparse = 0.92)
# Create TFIDF score
text_dtm <- as.data.frame(as.matrix(weightTfIdf(text_dtm)))
dim(text_dtm)
## [1] 67249 230
Big reduction in columns, that’s good as it filters down to only the (most likely) important words.
Train, Test, Split
Normally, I would do 5-fold cross validation for evaluate an ML model, but the spam/ham problem is well solved and understood. My guess is that Naive Bayes Classifier will probably do a sufficient job classifying these with a single split. This is a large dataset and will take some time to process, CV will only add to that computational load. We will do a 75/25 train/test split from the shuffled classes.
Modeling
A few points on the modeling process.
- I tried several models and Naive Bayes and XGB worked the best. Normally, when modeling is it a good idea to try several options to determine best performance and speed. I have standard practice that I like to use but it is in Python. See this repo for a Python Package I built for model testing and evaluation. It comes with a vignette on how to use. Having something to evaluate and store model results is very helpful to show the progress of the model development.
- The Naive Bayes Classifier worked well, but still needed several testing iterations to get to 91% accuracy. Filtering the sparse terms was essential and brought the classifier from 52% accuracy to 78%. Second, a little more sparse filtering and TFIDF scoring pulled the classifier up to 91% accuarcy.
- XGBoost is so good and very very quick to run on this dataset. Much faster than Naive Bayes with better results. For direct comparison, I used the same train/test sets as the Naive Bayes so the feature engineering needed for the NB also likely helped XGB.
Naive Bayes Classifier
The e1071
package has a Naive Bayes classifier. It needs the input training/test set to be a matrix and the target to be a vector. We can produce a confusion matrix to show the models performance. That confusion matrix comes the caret
package.
# Create confusion matrix
confusionMatrix(data = preds, reference = as.factor(test_target),
positive = "SPAM", dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Prediction HAM SPAM
## HAM 6240 210
## SPAM 1303 9059
##
## Accuracy : 0.91
## 95% CI : (0.9056, 0.9143)
## No Information Rate : 0.5513
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8156
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9773
## Specificity : 0.8273
## Pos Pred Value : 0.8743
## Neg Pred Value : 0.9674
## Prevalence : 0.5513
## Detection Rate : 0.5388
## Detection Prevalence : 0.6163
## Balanced Accuracy : 0.9023
##
## 'Positive' Class : SPAM
##
Awesome. There is some tricky spam in the sms dataset. Without the SMS dataset, this classifier had 97% accuaracy. Now let’s look at XGB.
XGBoost
XGB is a powerful tree boosted classifier that really sorted the last few, tricky SPAM/HAM messages.
test_split<-round(.25*dim(text_dtm)[1])
test_text<-text_dtm[1:test_split,]
train_text<-text_dtm[(test_split+1):dim(text_dtm)[1],]
# XGB needs numerical labels so we use the target column
test_target<-data$target[1:test_split]
train_target<-data$target[(test_split+1):dim(data)[1]]
# Vanilla parameters work pretty well in most cases
xgb <- xgboost(data = as.matrix(train_text),
label = as.vector(train_target),
max.depth = 7, eta = 1,
nthread = 2, nrounds = 2,
objective = "binary:logistic")
## [1] train-error:0.017864
## [2] train-error:0.015108
# Predict
xg_pred <- predict(xgb, as.matrix(test_text))
# Convert probabilities to binary
xg_pred<- ifelse(xg_pred >0.5, 1,0)
# Evaluate
confusionMatrix(data = factor(xg_pred, levels=c(1,0)),
reference = factor(test_target, levels=c(1,0)),
positive = "1", dnn = c("Prediction", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Prediction 1 0
## 1 9144 187
## 0 125 7356
##
## Accuracy : 0.9814
## 95% CI : (0.9793, 0.9834)
## No Information Rate : 0.5513
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9625
##
## Mcnemar's Test P-Value : 0.0005535
##
## Sensitivity : 0.9865
## Specificity : 0.9752
## Pos Pred Value : 0.9800
## Neg Pred Value : 0.9833
## Prevalence : 0.5513
## Detection Rate : 0.5439
## Detection Prevalence : 0.5550
## Balanced Accuracy : 0.9809
##
## 'Positive' Class : 1
##
Wow. So much better - down to ~300 wrongly classified messages out of ~16K in the test set. In addition to being a scary good classifier the xgboost
package has some nice features for interpreting how the model works. Below the feature importance plot showing the highest ranking model features for SPAM detection.
importance_matrix <- xgb.importance(model = xgb)
xgb.plot.importance(importance_matrix = importance_matrix)
Here is the matrix of importance factors for the XGBoost model.
DT::datatable(importance_matrix,
extensions = c('FixedColumns',"FixedHeader"),
options = list(scrollX = TRUE,
paging=TRUE,
fixedHeader=TRUE,
pageLength = 10))
We see that common meta data tags have the largest importance in this model with normal word stems rounding out the top 15 predictors. Even a relly good spam message usually comes from a strange looking address or has some irregular formatting.