Classifying Documents in the Reuters-21578 R8 Dataset

The Data

The data used in this text mining application is the Reuters-21578 R8 dataset (all terms). It is obtainable from here. Reuters, Ltd. is an international news agency headquartered in London and is a division of Thomson Reuters.

It is one of the most widely used test collections for text categorization research. The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system.

Loading Libraries

library(tm) # text mining
library(caret) # for machine learning

Loading the Data

Note that the data were copied from the source website and pasted into text files (r8-train-all-terms.txt and r8-test-all-terms.txt) which were then saved to my local directory. The data are loaded into R data frames where rows represent documents. The “Class” column contains the label for the documents topic (for supervised learning purposes) and the “docText” column contains the raw text of the documents. The train dataset contains 5485 documents and the test dataset contains 2189 documents. Note that the train/test split has already been done for us here. The train and test data are then merged so that they can become one corpus later on and receive the exact same preprocessing which is necessary for supervised learning later on. A variable “train_test” is created to denote whether a particular document belongs to the train or test dataset for splitting the data again later on.

This particular dataset comes conveniently packaged in a single .txt file that is tab delimited. Normally, one is forced to deal with individual .txt documents.

The data contains documents whose classes belong to 8 of the 10 most frequent document classes of Reuters’ artilcles. For the sake of computational expense/memory, the data is subset to documents belonging to only 3 of these classes.

# load the data
r8train <- read.table("r8-train-all-terms.txt", header=FALSE, sep='\t')
r8test <- read.table("r8-test-all-terms.txt", header=FALSE, sep='\t')

# explore the structure of the  data
str(r8train)

## 'data.frame':    5485 obs. of  2 variables:
##  $ V1: Factor w/ 8 levels "acq","crude",..: 3 1 3 3 3 3 3 3 3 3 ...
##  $ V2: Factor w/ 5423 levels "a and p gap sets higher capital spending the great atlantic and pacific tea co said its three year mln dlr capital program will"| __truncated__,..: 964 1184 1105 143 739 1441 737 1744 5078 3649 ...

str(r8test)

## 'data.frame':    2189 obs. of  2 variables:
##  $ V1: Factor w/ 8 levels "acq","crude",..: 8 4 7 1 3 3 1 3 3 3 ...
##  $ V2: Factor w/ 2176 levels "a g edwards inc age st qtr may net shr cts vs cts net vs revs mln vs mln avg shrs vs reuter ",..: 124 381 144 1835 64 251 501 250 396 2080 ...

# rename variables
names(r8train) <- c("Class", "docText")
names(r8test) <- c("Class", "docText")

# convert the document text variable to character type
r8train$docText <- as.character(r8train$docText)
r8test$docText <- as.character(r8test$docText)

# create varible to denote if observation is train or test
r8train$train_test <- c("train")
r8test$train_test <- c("test")

# merge the train/test data
merged <- rbind(r8train, r8test)

# remove objects that are no longer needed 
remove(r8train, r8test)

# subset to 3 document classes only for sake of computational expense/memory
# not doing so will result in a stack overflow or long computational times for ML algorithms
merged <- merged[which(merged$Class %in% c("crude","money-fx","trade")),]

# drop unused levels in the response variable
merged$Class <- droplevels(merged$Class) 

# counts of each class in the train/test sets
table(merged$Class,merged$train_test)

##           
##            test train
##   crude     121   253
##   money-fx   87   206
##   trade      75   251

The quality of the tagged dataset is by far the most important component of a text classifier. The dataset needs to be large enough to have an adequate number of documents in each class. The amount of data you need varies depending on the application and objectives. Here, there is still a sufficient amount of data after subsetting the overall data down to just the crude, money-fx, and trade documents.

In general, the dataset also needs to be of a high enough quality in terms of how distinct the documents in the different categories are from each other to allow clear delineation between the categories.

Pre-processing

I begin by creating a corpus (large and structured collection of documents containing text) of all train/test documents. Raw data is usually never ready for analysis. Several common text pre-processing tasks are performed:

remove punctuation
remove digits
remove extra white space
remove stop words (e.g. the, and, is, for)
conversion to lower case

The tm_map function applies these cleaning tasks to the entire corpus. The other major pre-processing component is creating the term document matrix (TDM) which contains the frequency of terms in each document. That is, rows of the TDM represent documents and columns represent a unique word in the corpus. So, the (i,j)th entry of the TDM contains the number of times “word j” appeared in “document i”. By representing the corpus this way, we open the door for machine learning techniques to be used. In particular, algorithms such as kNN, Naive Bayes, SVM, etc. require that data be represented in the form of a table.

It is common to produce a weighted version of the TDM by term frequency - inverse document frequency (tf-idf). This weighted version takes into account how often a term is used in the entire corpus as well as in a single document. The logic is that if a term is used in the entire corpus frequently, it is probably not as important when differentiating documents. Alternatively, if a word appears rarely in the corpus, it may be an important differentiation even if it only occurs a few times in a document. In this analysis, both the unweighted and tf-idf weighted TDM’s are computed. The performance of various supervised learning techniques using both versions of the TDM are compared later on.

Frequent terms can also be found (those which appear in at least a specified number of documents). Lastly, I convert the TDM’s into data frames for modeling purposes, split them back into the original train/test sets, and append the document labels as the last column in these data frames (using the variable “doc.class”).

# a vector source interprets each element of the vector as a document
sourceData <- VectorSource(merged$docText)

# create the corpus
corpus <- Corpus(sourceData)

# example document before pre-processing
corpus[[20]]$content

## [1] "saudi arabia reiterates commitment to opec accord saudi arabian oil minister hisham nazer reiterated the kingdom s commitment to last december s opec accord to boost world oil prices and stabilize the market the official saudi press agency spa said asked by the agency about the recent fall in free market oil prices nazer said saudi arabia is fully adhering by the accord and it will never sell its oil at prices below the pronounced prices under any circumstance saudi arabia was a main architect of december pact under which opec agreed to cut its total oil output ceiling by pct and return to fixed prices of around dollars a barrel reuter "

# preprocess/clean the training corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lowercase
corpus <- tm_map(corpus, removeNumbers) # remove digits
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, stripWhitespace) # strip extra whitespace
corpus <- tm_map(corpus, removeWords, stopwords('english')) # remove stopwords

# example document after pre-processing
corpus[[20]]$content

## [1] "saudi arabia reiterates commitment  opec accord saudi arabian oil minister hisham nazer reiterated  kingdom s commitment  last december s opec accord  boost world oil prices  stabilize  market  official saudi press agency spa said asked   agency   recent fall  free market oil prices nazer said saudi arabia  fully adhering   accord   will never sell  oil  prices   pronounced prices   circumstance saudi arabia   main architect  december pact   opec agreed  cut  total oil output ceiling  pct  return  fixed prices  around dollars  barrel reuter "

# create term document matrix (tdm)
tdm <- DocumentTermMatrix(corpus)

# inspecting the tdm
dim(tdm) # 993 documents, 9243 terms

## [1]  993 9243

colnames(tdm)[200:210] # sample of columns (words)

##  [1] "african"    "aftermath"  "afternoon"  "afterwards" "agcny"     
##  [6] "aged"       "ageing"     "agencies"   "agency"     "agenda"    
## [11] "agent"

as.matrix(tdm)[10:20,200:210] # inspect a portion of the tdm

##     Terms
## Docs african aftermath afternoon afterwards agcny aged ageing agencies
##   10       0         0         0          0     0    0      0        0
##   11       0         0         0          0     0    0      0        0
##   12       0         0         0          0     0    0      0        0
##   13       0         0         0          0     0    0      0        0
##   14       0         0         0          0     0    0      0        0
##   15       0         0         0          0     0    0      0        0
##   16       0         0         0          0     0    0      0        0
##   17       1         0         0          0     0    0      0        0
##   18       0         0         0          0     0    0      0        0
##   19       0         0         0          0     0    0      0        0
##   20       0         0         0          0     0    0      0        0
##     Terms
## Docs agency agenda agent
##   10      0      0     0
##   11      0      0     0
##   12      0      0     0
##   13      2      0     0
##   14      0      0     0
##   15      0      0     0
##   16      0      0     0
##   17      0      0     0
##   18      0      0     0
##   19      1      0     0
##   20      2      0     0

# create tf-idf weighted version of term document matrix
weightedtdm <- weightTfIdf(tdm)
as.matrix(weightedtdm)[10:20,200:210] # inspect same portion of the weighted tdm

##     Terms
## Docs   african aftermath afternoon afterwards agcny aged ageing agencies
##   10 0.0000000         0         0          0     0    0      0        0
##   11 0.0000000         0         0          0     0    0      0        0
##   12 0.0000000         0         0          0     0    0      0        0
##   13 0.0000000         0         0          0     0    0      0        0
##   14 0.0000000         0         0          0     0    0      0        0
##   15 0.0000000         0         0          0     0    0      0        0
##   16 0.0000000         0         0          0     0    0      0        0
##   17 0.1276481         0         0          0     0    0      0        0
##   18 0.0000000         0         0          0     0    0      0        0
##   19 0.0000000         0         0          0     0    0      0        0
##   20 0.0000000         0         0          0     0    0      0        0
##     Terms
## Docs     agency agenda agent
##   10 0.00000000      0     0
##   11 0.00000000      0     0
##   12 0.00000000      0     0
##   13 0.04114098      0     0
##   14 0.00000000      0     0
##   15 0.00000000      0     0
##   16 0.00000000      0     0
##   17 0.00000000      0     0
##   18 0.00000000      0     0
##   19 0.06114790      0     0
##   20 0.11903458      0     0

# find frequent terms: terms that appear in at least "250" documents here, about 25% of the docs
findFreqTerms(tdm, 250)

##  [1] "agreement"  "also"       "bank"       "barrel"     "billion"   
##  [6] "bpd"        "countries"  "crude"      "deficit"    "dlrs"      
## [11] "dollar"     "economic"   "exchange"   "exports"    "february"  
## [16] "foreign"    "government" "imports"    "industry"   "japan"     
## [21] "japanese"   "last"       "market"     "markets"    "meeting"   
## [26] "minister"   "mln"        "new"        "official"   "officials" 
## [31] "oil"        "one"        "opec"       "pct"        "price"     
## [36] "prices"     "production" "reuter"     "said"       "states"    
## [41] "stg"        "today"      "told"       "trade"      "two"       
## [46] "united"     "week"       "west"       "will"       "world"     
## [51] "year"

# convert tdm's into data frames 
tdm <- as.data.frame(inspect(tdm))
weightedtdm <- as.data.frame(inspect(weightedtdm))

# split back into train and test sets
tdmTrain <- tdm[which(merged$train_test == "train"),]
weightedTDMtrain <- weightedtdm[which(merged$train_test == "train"),]
  
tdmTest <-  tdm[which(merged$train_test == "test"),]
weightedTDMtest <- weightedtdm[which(merged$train_test == "test"),]

# remove objects that are no longer needed to conserve memory
remove(tdm,weightedtdm)

# append document labels as last column
tdmTrain$doc.class <- merged$Class[which(merged$train_test == "train")]
tdmTest$doc.class <- merged$Class[which(merged$train_test == "test")]
weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")]
weightedTDMtest$doc.class  <- merged$Class[which(merged$train_test == "test")]

Supervised Analysis: Document Classification

Remember, these methods are only available when the training data is tagged/labeled. In practice, people most often have to tag their own data (or hire poor grad students) if accuracy is important.

Repeated 10-fold cross-validation is also used. That is, the training data is split into 10 subsets (folds). Then, the algorithms are run 10 times each time using a different fold as the training data. This process is repeated 3 separate times and the average error across all trials is computed to help reduce overfitting in the models. Notice that, before each call to train, I set the random number seed. That has the effect of using the same resamping data sets for all models which is necessary for comparing models.

Currently, 3 models are developed and compared. They are:

K-Nearest Neighbors (kNN)
Support Vector Machine (SVM)
Decision Tree

Naive Bayes was also considered but due to its poor performance (approximately 37% accuracy for the unweighted TDM and 49% for the weighted), it is not discussed here.

Fitting the Models / Prediction

K-Nearest Neighbors (kNN)

kNN predicts the class of a new document by finding the k most similar documents (neighbors) in the training data and using the class among those k neighbors that’s the most common. Alternatively, the classes of the neighbors are weighted using the similarity of each neighbor to the new document, where similarity is measured by Euclidean distance or the cosine value between two document vectors.

The assumption of kNN is that the classification of a new document is most similar to the classification of other documents that are “nearby” in the vector space.

kNN doesn’t rely on prior probabilities and is computationally efficient. The main computation is the sorting of training documents in order to find the k-nearest neighbors for the new document. An alternative to kNN is hierarchical clustering if there’s believed to be a hierarchical strucutre to the data’s categories/labels.

# set resampling scheme
ctrl <- trainControl(method="repeatedcv",number = 10, repeats = 3) #,classProbs=TRUE)

# fit a kNN model using the weighted (td-idf) term document matrix
# tuning parameter: K
set.seed(100)
knn.tfidf <- train(doc.class ~ ., data = weightedTDMtrain, method = "knn", trControl = ctrl) #, tuneLength = 20)

# predict on test data
knn.tfidf.predict <- predict(knn.tfidf, newdata = weightedTDMtest)

##################################################################
# ---------------------------------------------------------------
##################################################################

# fit a kNN model using the unweighted TDM
# tuning parameter: K
set.seed(100)
knn <- train(doc.class ~ ., data = tdmTrain, method = "knn", trControl = ctrl) #, tuneLength = 20)

# predict on test data
knn.predict <- predict(knn, newdata = tdmTest)

Support Vector Machine (SVM)

From Wikipedia - “Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.” The main task of an SVM is finding the optimal hyperplane with maximum distance from the nearest training patterns which are the support vectors. Only the support vectors are used for prediction/categorization which makes it computationally efficient.

cons of SVM:

Parameters of a solved SVM model are difficult to interpret
pre-processing the data is important
cross-validation is important (debatable if this is a con)

pros of SVM:

they are relatively fast and great for text categorization.

important: A SVM is only directly applicable for two-class tasks. If the variable of interest has more than two classes, a multiclass implementation must be performed. In this analysis,there are more than two classes to classify and thus I am implementing a multiclass SVM. The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems. Common methods for such reduction include:

Building binary classifiers which distinguish (i) between one of the labels and the rest (one-versus-all) or (ii) between every pair of classes (one-versus-one). Classification of new instances for the one-versus-all case is done by a winner-takes-all strategy, in which the classifier with the highest output function assigns the class (it is important that the output functions be calibrated to produce comparable scores). For the one-versus-one approach, classification is done by a max-wins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with the most votes determines the instance classification.
Directed acyclic graph SVM (DAGSVM)
Error-correcting output codes

Here, I fit SVM’s with the caret package and compare the performance of linear and radial basis function kernels.

# set resampling scheme: 10-fold cross-validation, 3 times
ctrl <- trainControl(method="repeatedcv", number = 10, repeats = 3) #,classProbs=TRUE)

# fit a multiclass SVM using the weighted (td-idf) term document matrix
# kernel: linear 
# tuning parameters: C 
set.seed(100)
svm.tfidf.linear  <- train(doc.class ~ . , data=weightedTDMtrain, trControl = ctrl, method = "svmLinear")

# fit another multiclass SVM using the weighted (td-idf) term document matrix
# kernel: radial basis function
# tuning parameters: sigma, C 
set.seed(100)
svm.tfidf.radial  <- train(doc.class ~ . , data=weightedTDMtrain, trControl = ctrl, method = "svmRadial")

# predict on test data
svm.tfidf.linear.predict <- predict(svm.tfidf.linear,newdata = weightedTDMtest)
svm.tfidf.radial.predict <- predict(svm.tfidf.radial,newdata = weightedTDMtest)


##################################################################
# ---------------------------------------------------------------
##################################################################

# fit a multiclass SVM using the unweighted TDM
# kernel: linear 
# tuning parameters: C 
set.seed(100)
svm.linear  <- train(doc.class ~ . , data=tdmTrain, trControl = ctrl, method = "svmLinear")

# fit another multiclass SVM using the unweighted TDM
# kernel: radial basis function
# tuning parameters: sigma, C 
set.seed(100)
svm.radial  <- train(doc.class ~ . , data=tdmTrain, trControl = ctrl, method = "svmRadial")

# predict on test data
svm.linear.predict <- predict(svm.linear,newdata = tdmTest)
svm.radial.predict <- predict(svm.radial,newdata = tdmTest)

Decision Tree

From Wikipedia - “A decision tree is a flowchart-like structure in which each internal node represents a”test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules."

Decision trees work by iteratively finding the variable (in a group of variables) that best splits the outcome into two groups. The recursion stops when all groups are sufficiently small/homogeneous/pure. This produces a non-linear model and the classification tree uses interactions between variables. Usually, the ultimate group/leaf will depend on many variables. Thus, decision trees can be viewed as predicive models which map observations about an item to conclusions about the items target value.

Different algorithms use different metrics for measuring the “best” split. Generally, they measure homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combinded (e.g. by averaging) to provide a measure of the quality of the split. Some of the metrics are:

impurity
misclassification error
gini index
information gain
deviance

pros of decision trees:

easy to interpret
can handle non-linear relationships well
transforming the input features is less important (monotonic transformations produce the same splits)
can be used for regression as well

cons of decision trees:

prone to overfitting, especially with a large number of variables (must use proper cross-validation)
harder to estiamte uncertainty of the model

note: the data frames are wrapped by data.frame() in order to ensure syntactically valid variable names. rpart requires this and if it is not done errors will occur.

# set resampling scheme: 10-fold cross-validatoin, 3 times
ctrl <- trainControl(method="repeatedcv", number = 10, repeats = 3) #,classProbs=TRUE)

# fit a decision tree using the weighted (td-idf) term document matrix
# tuning parameter: cp
set.seed(100)
tree.tfidf  <- train(doc.class ~ . , data = data.frame(weightedTDMtrain), method = "rpart",trControl = ctrl )

# predict on test data
tree.tfidf.predict <- predict(tree.tfidf,newdata = data.frame(weightedTDMtest))

##################################################################
# ---------------------------------------------------------------
##################################################################


# fit a decision tree using the unweighted TDM
# tuning parameter: cp
set.seed(100)
tree <- train(doc.class ~ . , data = data.frame(tdmTrain), method = "rpart", trControl = ctrl) 

# predict on test data
tree.predict <- predict(tree,newdata = data.frame(tdmTest))

Evaluating the Models

K-Nearest Neighbors (kNN)

# --------------- Weighted (tfi-idf) TDM ---------------------#
# Output of kNN fit 
knn.tfidf

## k-Nearest Neighbors 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8554163  0.7821269
##   7  0.8488569  0.7718249
##   9  0.8252937  0.7359511
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 5.

# plot accuracy vs. number of neighbors
plot(knn.tfidf)

# --------------- Unweighted TDM ---------------------#
# output the kNN fit
knn

## k-Nearest Neighbors 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9182156  0.8771093
##   7  0.9177392  0.8764557
##   9  0.9196563  0.8794357
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 9.

# plot accuracy vs. number of neighbors
plot(knn)

# confusion matrices allow you to evaluate accuracy and other metrics
confusionMatrix(knn.predict, tdmTest$doc.class)  # unweighted TDM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      113        0     0
##   money-fx     7       83     5
##   trade        1        4    70
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9399          
##                  95% CI : (0.9056, 0.9646)
##     No Information Rate : 0.4276          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9084          
##  Mcnemar's Test P-Value : 0.04377         
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.9339          0.9540       0.9333
## Specificity                1.0000          0.9388       0.9760
## Pos Pred Value             1.0000          0.8737       0.9333
## Neg Pred Value             0.9529          0.9787       0.9760
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.3993          0.2933       0.2473
## Detection Prevalence       0.3993          0.3357       0.2650
## Balanced Accuracy          0.9669          0.9464       0.9546

confusionMatrix(knn.tfidf.predict, weightedTDMtest$doc.class)  # weighted TDM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      103        0     2
##   money-fx     1       78     2
##   trade       17        9    71
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8905          
##                  95% CI : (0.8481, 0.9243)
##     No Information Rate : 0.4276          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8347          
##  Mcnemar's Test P-Value : 0.0006141       
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.8512          0.8966       0.9467
## Specificity                0.9877          0.9847       0.8750
## Pos Pred Value             0.9810          0.9630       0.7320
## Neg Pred Value             0.8989          0.9554       0.9785
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.3640          0.2756       0.2509
## Detection Prevalence       0.3710          0.2862       0.3428
## Balanced Accuracy          0.9194          0.9406       0.9108

# print info about parameters, etc. used in the model with highest accuracy
knn$results # error rate and values of tuning parameter

##   k  Accuracy     Kappa AccuracySD    KappaSD
## 1 5 0.9182156 0.8771093 0.03277920 0.04908560
## 2 7 0.9177392 0.8764557 0.03702300 0.05547952
## 3 9 0.9196563 0.8794357 0.03783284 0.05656316

knn$bestTune # final tuning parameter

##   k
## 3 9

knn$metric # metric used to select optimal model

## [1] "Accuracy"

knn$times # a list of execution times

## $everything
##    user  system elapsed 
##  453.83    1.83  455.88 
## 
## $final
##    user  system elapsed 
##    0.12    0.00    0.13 
## 
## $prediction
## [1] NA NA NA

The model that does NOT use the weighted term document matrix (that is, the one that uses just raw frequency counts in the term document matrix) had higher accuracy.

Support Vector Machine (SVM)

# output the fit of SVM's using the weighted (tf-idf) TDM
svm.tfidf.linear # linear kernel

## Support Vector Machines with Linear Kernel 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9671054  0.9502858
## 
## Tuning parameter 'C' was held constant at a value of 1
##

svm.tfidf.radial # radial basis function kernel

## Support Vector Machines with Radial Basis Function Kernel 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.7651909  0.6400185
##   0.50  0.8713132  0.8043525
##   1.00  0.9464666  0.9190532
## 
## Tuning parameter 'sigma' was held constant at a value of 1.158257
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 1.158257 and C = 1.

plot(svm.tfidf.radial)

# output the fit of SVM's using the unweighted TDM
svm.linear

## Support Vector Machines with Linear Kernel 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9684738  0.9525237
## 
## Tuning parameter 'C' was held constant at a value of 1
##

svm.radial

## Support Vector Machines with Radial Basis Function Kernel 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.8488007  0.7726513
##   0.50  0.8924019  0.8383231
##   1.00  0.9294970  0.8940379
## 
## Tuning parameter 'sigma' was held constant at a value of 0.003569925
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were sigma = 0.003569925 and C = 1.

plot(svm.radial)

# confusion matrices allow you to evaluate accuracy and other metrics
confusionMatrix(svm.linear.predict, tdmTest$doc.class)  # linear kernel, unweighted TDM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      119        0     2
##   money-fx     1       84     2
##   trade        1        3    71
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9682          
##                  95% CI : (0.9405, 0.9854)
##     No Information Rate : 0.4276          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9513          
##  Mcnemar's Test P-Value : 0.6746          
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.9835          0.9655       0.9467
## Specificity                0.9877          0.9847       0.9808
## Pos Pred Value             0.9835          0.9655       0.9467
## Neg Pred Value             0.9877          0.9847       0.9808
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.4205          0.2968       0.2509
## Detection Prevalence       0.4276          0.3074       0.2650
## Balanced Accuracy          0.9856          0.9751       0.9637

confusionMatrix(svm.radial.predict, tdmTest$doc.class)  # radial kernel, unweighted TDM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      118        0     2
##   money-fx     2       87     0
##   trade        1        0    73
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9823          
##                  95% CI : (0.9593, 0.9942)
##     No Information Rate : 0.4276          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9729          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.9752          1.0000       0.9733
## Specificity                0.9877          0.9898       0.9952
## Pos Pred Value             0.9833          0.9775       0.9865
## Neg Pred Value             0.9816          1.0000       0.9904
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.4170          0.3074       0.2580
## Detection Prevalence       0.4240          0.3145       0.2615
## Balanced Accuracy          0.9814          0.9949       0.9843

confusionMatrix(svm.tfidf.linear.predict, weightedTDMtest$doc.class) # linear kernel, weighted TDM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      119        0     0
##   money-fx     0       80     1
##   trade        2        7    74
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9647         
##                  95% CI : (0.936, 0.9829)
##     No Information Rate : 0.4276         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.946          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.9835          0.9195       0.9867
## Specificity                1.0000          0.9949       0.9567
## Pos Pred Value             1.0000          0.9877       0.8916
## Neg Pred Value             0.9878          0.9653       0.9950
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.4205          0.2827       0.2615
## Detection Prevalence       0.4205          0.2862       0.2933
## Balanced Accuracy          0.9917          0.9572       0.9717

confusionMatrix(svm.tfidf.radial.predict, weightedTDMtest$doc.class) # radial kernel, weighted TDM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      120       10     0
##   money-fx     0       73     1
##   trade        1        4    74
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9435          
##                  95% CI : (0.9098, 0.9673)
##     No Information Rate : 0.4276          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9129          
##  Mcnemar's Test P-Value : 0.00509         
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.9917          0.8391       0.9867
## Specificity                0.9383          0.9949       0.9760
## Pos Pred Value             0.9231          0.9865       0.9367
## Neg Pred Value             0.9935          0.9330       0.9951
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.4240          0.2580       0.2615
## Detection Prevalence       0.4594          0.2615       0.2792
## Balanced Accuracy          0.9650          0.9170       0.9813

# print various info about parameters, etc. used in the model with highest accuracy
svm.radial$results # error rate and values of tuning parameter

##         sigma    C  Accuracy     Kappa AccuracySD    KappaSD
## 1 0.003569925 0.25 0.8488007 0.7726513 0.04676278 0.07002347
## 2 0.003569925 0.50 0.8924019 0.8383231 0.04262555 0.06379599
## 3 0.003569925 1.00 0.9294970 0.8940379 0.03171192 0.04760602

svm.radial$bestTune # final tuning parameter

##         sigma C
## 3 0.003569925 1

svm.radial$metric # metric used to select optimal model

## [1] "Accuracy"

svm.radial$times # a list of execution times

## $everything
##    user  system elapsed 
##  694.94   57.41  363.74 
## 
## $final
##    user  system elapsed 
##    6.60    0.33    3.88 
## 
## $prediction
## [1] NA NA NA

The radial basis function kernel and unweighted term document matrix to train the SVM had the highest accuracy (98.23%). The few cases it did predict wrong were trade and crude classes which is encouraging.

Decision Tree

# output the decision tree fit using the weighted (tf-idf) TDM
tree.tfidf

## CART 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy  Kappa    
##   0.03719912  0.892494  0.8388971
##   0.36323851  0.778399  0.6607596
##   0.47483589  0.482008  0.1954103
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03719912.

plot(tree.tfidf) # accuracy vs. complexity parameter values

# output the decision tree fit using the unweighted TDM
tree

## CART 
## 
##  710 samples
## 9243 predictors
##    3 classes: 'crude', 'money-fx', 'trade' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.04376368  0.8759610  0.8134491
##   0.33916849  0.7529400  0.6206175
##   0.46170678  0.4882577  0.2053925
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.04376368.

plot(tree)  # accuracy vs. complexity parameter values

# confusion matrices allow you to evaluate accuracy and other metrics
confusionMatrix(tree.predict , tdmTest$doc.class)  # unweighted TDM tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      116        0     5
##   money-fx     5       78     5
##   trade        0        9    65
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9152          
##                  95% CI : (0.8764, 0.9449)
##     No Information Rate : 0.4276          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.87            
##  Mcnemar's Test P-Value : 0.01098         
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.9587          0.8966       0.8667
## Specificity                0.9691          0.9490       0.9567
## Pos Pred Value             0.9587          0.8864       0.8784
## Neg Pred Value             0.9691          0.9538       0.9522
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.4099          0.2756       0.2297
## Detection Prevalence       0.4276          0.3110       0.2615
## Balanced Accuracy          0.9639          0.9228       0.9117

confusionMatrix(tree.tfidf.predict, weightedTDMtest$doc.class) # weighted TDM tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction crude money-fx trade
##   crude      115        0     4
##   money-fx     6       84     6
##   trade        0        3    65
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9329          
##                  95% CI : (0.8971, 0.9591)
##     No Information Rate : 0.4276          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8971          
##  Mcnemar's Test P-Value : 0.01173         
## 
## Statistics by Class:
## 
##                      Class: crude Class: money-fx Class: trade
## Sensitivity                0.9504          0.9655       0.8667
## Specificity                0.9753          0.9388       0.9856
## Pos Pred Value             0.9664          0.8750       0.9559
## Neg Pred Value             0.9634          0.9840       0.9535
## Prevalence                 0.4276          0.3074       0.2650
## Detection Rate             0.4064          0.2968       0.2297
## Detection Prevalence       0.4205          0.3392       0.2403
## Balanced Accuracy          0.9629          0.9521       0.9261

# print info about parameters, etc. used in the model with highest accuracy
tree.tfidf$results # error rate and values of tuning parameter

##           cp Accuracy     Kappa AccuracySD    KappaSD
## 1 0.03719912 0.892494 0.8388971 0.03210086 0.04759729
## 2 0.36323851 0.778399 0.6607596 0.10880605 0.17153519
## 3 0.47483589 0.482008 0.1954103 0.14614528 0.22758974

tree.tfidf$bestTune # final tuning parameter

##           cp
## 1 0.03719912

tree.tfidf$metric # metric used to select optimal model

## [1] "Accuracy"

tree.tfidf$times # a list of execution times

## $everything
##    user  system elapsed 
##  777.58   25.75  781.58 
## 
## $final
##    user  system elapsed 
##    8.99    0.20    8.35 
## 
## $prediction
## [1] NA NA NA

The decision tree using the td-idf weighted matrix had a higher accuracy (93.29%)

Conclusion

With respect to accuracy, a SVM using a radial basis function kernel and the unweighted term document matrix for training was the best performing model.

Future Work

Several things can be done going forward to strengthen the analysis, including:

Try more ML algorithms (e.g. a neural network, multinomial logistic regression, etc.)
Implement LSA (Latent Semantic Analysis)

Important Question

Do these models need to be retrained every single time a new document comes in? The concern is that new documents will have words that aren’t in the corpus that was used to train the model. Right now, my guess is that you could just eliminate any of these new words as there should be very few, if any since our corpus was large.

Classifying Documents in the Reuters-21578 R8 Dataset

Bryan Cole

August 14, 2016

The Data

Loading Libraries

Loading the Data

Pre-processing

Supervised Analysis: Document Classification

Fitting the Models / Prediction

K-Nearest Neighbors (kNN)

Support Vector Machine (SVM)

Decision Tree

Evaluating the Models

K-Nearest Neighbors (kNN)

Support Vector Machine (SVM)

Decision Tree

Conclusion

Future Work

Important Question