Document Classification
Introduction
For Project 4 we will try to classify new “test” documents using already classified “training” documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam. For this project, we used kaggle spam/ham dataset, then predict the class of new documents. It can be useful to be able to classify new “test” documents using already classified “training” documents.
Spam Filter Data
I will be using kaggle Spam Filter dataset for this project. This dataset contains a csv file having spam and ham data. This csv file has following attributes: ‘text’ and ‘spam’. The variable ‘spam’ is target variable having values 1 (for spam) or 0 (for ham).
In this project, we will follow below steps:
- Loading dataset from github
- Data analysis
- Text pre-preprocessing
- Apply Model
- Conclusion
Load packages
Let’s load necessary packages which will be needed here.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Warning: package 'SnowballC' was built under R version 3.6.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## Loading required package: RColorBrewer
Load dataset and analysis
First we will load the dataset from github. The dataset has 5,728 observations and 2 columns text and spam indicator.
#github URL
url <- getURL("https://raw.githubusercontent.com/amit-kapoor/data607/master/project4/emails.csv")
# Read csv from github
spamham_df <- read.csv(text = url,stringsAsFactors = FALSE)
# glimpse data
glimpse(spamham_df)
## Observations: 5,728
## Variables: 2
## $ text <chr> "Subject: naturally irresistible your corporate identity lt is …
## $ spam <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## [1] 5728 2
In the next couple of setps we will first see the counts of spam attribute values and then draw the histogram to visualize its count.
## # A tibble: 2 x 2
## spam n
## <int> <int>
## 1 0 4360
## 2 1 1368
Corpus and text pre-processing
The text attribute contains unstructured data with upper/lower cases, stop words, punctuation, numbers and all. In this section we will address all this by following the standard steps to build and pre-process the corpus.
- Build a new corpus variable called sh_corpus
- Using tm_map, convert the text to lowercase. This will make it uniform.
- Using tm_map, remove numbers.
- Using tm_map, remove all punctuation from the corpus.
- Using tm_map, remove all English stopwords from the corpus as they dont add any value.
- Using tm_map, stem the words in the corpus. Word stemming reduces words to unify across documents.
Lets first build the corpus for the text column in our dataset.
Now we will use tm_map transformation.
# lower case
sh_corpus <- tm_map(sh_corpus, tolower)
# remove numbers
sh_corpus <- tm_map(sh_corpus, removeNumbers)
# remove punctuation
sh_corpus <- tm_map(sh_corpus, removePunctuation)
# remove stop words
sh_corpus <- tm_map(sh_corpus, removeWords, stopwords())
# stem document
sh_corpus <- tm_map(sh_corpus, stemDocument)
Next we will create a Document Term Matrix. We will now ready extract the word frequencies to be used in our prediction problem. The tm package provides a function called DocumentTermMatrix() that provides a matrix where the rows correspond to documents and the columns correspond to words. The values in the matrix are the word frequencies in each document.
## <<DocumentTermMatrix (documents: 5728, terms: 25172)>>
## Non-/sparse entries: 462906/143722310
## Sparsity : 100%
## Maximal term length: 24
## Weighting : term frequency (tf)
We see a high sparcity percentage so will filter out few sparse words.
## <<DocumentTermMatrix (documents: 5728, terms: 1429)>>
## Non-/sparse entries: 352457/7832855
## Sparsity : 96%
## Maximal term length: 13
## Weighting : term frequency (tf)
Now we’ll re-construct our dataset from newly created dtm, and add the target variable spam in it.
# reconstruct data frame from dtm
sh_finaldf <- as.data.frame(as.matrix(sh_dtm))
# add target variable
sh_finaldf$spam <- spamham_df$spam
# lets see dimension
dim(sh_finaldf)
## [1] 5728 1430
In the next couple of steps we will use wordcloud to see most frequest spam and ham words visualization.
# spam word cloud
spam_indices <- which(spamham_df$spam == "1")
suppressWarnings(wordcloud(sh_corpus[spam_indices], min.freq=50, max.words = 100, random.order = FALSE, random.color = TRUE,colors=palette()))
# ham word cloud
ham_indices <- which(spamham_df$spam == "0")
suppressWarnings(wordcloud(sh_corpus[ham_indices], min.freq=50, max.words = 100, random.order = FALSE, random.color = TRUE,colors=palette()))
Train/Test Split and Applying model
We will now use this data for out modeling. Before applying models we will split up our dataset into a training set and testing set. Our model(s) will learn from training data and using testing set we’ll test our model(s). It is commong to avoid overfitting. We’ll do a 75:25 for train and test split and use caTools to split it up.
set.seed(197)
sample <- sample.split(sh_finaldf, SplitRatio = 0.75)
train_df <- subset(sh_finaldf, sample == TRUE)
test_df <- subset(sh_finaldf, sample == FALSE)
To apply model we will follow below steps:
- Initialize each model classifier with the training set data and target variable in training data.
- Make predictions on the test set.
- Summarize model.
- Check the resuts using confusion matrix.
Support Vector Machine
We will first start with Support Vector Machine model. A Support Vector Machine (SVM) is a classifier defined by a separating hyperplane i.e. given labeled training data the algorithm outputs an optimal hyperplane which categorizes new data. Lets use the above defined steps using svm.
# initialize svm
model_svm <- svm(train_df, as.factor(train_df$spam))
# make predictions using svm
predict_svm <- predict(model_svm, test_df)
##
## Call:
## svm.default(x = train_df, y = as.factor(train_df$spam))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 1562
##
## ( 501 1061 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
## Predicted Class
## Actual Class 0 1
## 0 1093 0
## 1 20 321
# svm confusion matrix
confusionMatrix(data = predict_svm, reference = as.factor(test_df$spam),
positive = "1", dnn = c("svm prediction", "svm actual"))
## Confusion Matrix and Statistics
##
## svm actual
## svm prediction 0 1
## 0 1093 20
## 1 0 321
##
## Accuracy : 0.9861
## 95% CI : (0.9785, 0.9915)
## No Information Rate : 0.7622
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9607
##
## Mcnemar's Test P-Value : 2.152e-05
##
## Sensitivity : 0.9413
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9820
## Prevalence : 0.2378
## Detection Rate : 0.2238
## Detection Prevalence : 0.2238
## Balanced Accuracy : 0.9707
##
## 'Positive' Class : 1
##
Random Forest
Next we will use another model RandomForest to make predictions and do comparison with svm. It is an ensemble tree-based learning algorithm. The Random Forest Classifier uses a set of decision trees from randomly selected subset of training data and aggregates the results from different decision trees to make decision of the final class of the test data.
# initialize random forest
model_rf <- randomForest(train_df, as.factor(train_df$spam))
# make predictions using random forest
predict_rf <- predict(model_rf, test_df)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 4294 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 8588 matrix numeric
## oob.times 4294 -none- numeric
## classes 2 -none- character
## importance 1430 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 4294 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## Predicted Class
## Actual Class 0 1
## 0 1093 0
## 1 0 341
# rf confusion matrix
confusionMatrix(data = predict_rf, reference = as.factor(test_df$spam), positive = "1", dnn = c("rf prediction", "rf actual"))
## Confusion Matrix and Statistics
##
## rf actual
## rf prediction 0 1
## 0 1093 0
## 1 0 341
##
## Accuracy : 1
## 95% CI : (0.9974, 1)
## No Information Rate : 0.7622
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.2378
## Detection Rate : 0.2378
## Detection Prevalence : 0.2378
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 1
##
Conclusion
Looking at above models implemented on training dataset, predicted for our test dataset and the accuracy shown for both models it seems that Random Forest Model’s accuracy (~99%) is more accurate than SVM (~98% ) Model. Also Random Forest took more time to run as compared to SVM. We could also use hyper paramter tuning to improve our model performance.
Recommendation
We could run few more advance models and compare results for further improvements. We could also further modify our corpus by reducing data sparsity.