The data used in this text mining application is the Reuters-21578 R8 dataset (all terms). It is obtainable from here. Reuters, Ltd. is an international news agency headquartered in London and is a division of Thomson Reuters.
It is one of the most widely used test collections for text categorization research. The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system.
library(tm) # text mining
library(caret) # for machine learning
Note that the data were copied from the source website and pasted into text files (r8-train-all-terms.txt and r8-test-all-terms.txt) which were then saved to my local directory. The data are loaded into R data frames where rows represent documents. The “Class” column contains the label for the documents topic (for supervised learning purposes) and the “docText” column contains the raw text of the documents. The train dataset contains 5485 documents and the test dataset contains 2189 documents. Note that the train/test split has already been done for us here. The train and test data are then merged so that they can become one corpus later on and receive the exact same preprocessing which is necessary for supervised learning later on. A variable “train_test” is created to denote whether a particular document belongs to the train or test dataset for splitting the data again later on.
This particular dataset comes conveniently packaged in a single .txt file that is tab delimited. Normally, one is forced to deal with individual .txt documents.
The data contains documents whose classes belong to 8 of the 10 most frequent document classes of Reuters’ artilcles. For the sake of computational expense/memory, the data is subset to documents belonging to only 3 of these classes.
# load the data
r8train <- read.table("r8-train-all-terms.txt", header=FALSE, sep='\t')
r8test <- read.table("r8-test-all-terms.txt", header=FALSE, sep='\t')
# explore the structure of the data
str(r8train)
## 'data.frame': 5485 obs. of 2 variables:
## $ V1: Factor w/ 8 levels "acq","crude",..: 3 1 3 3 3 3 3 3 3 3 ...
## $ V2: Factor w/ 5423 levels "a and p gap sets higher capital spending the great atlantic and pacific tea co said its three year mln dlr capital program will"| __truncated__,..: 964 1184 1105 143 739 1441 737 1744 5078 3649 ...
str(r8test)
## 'data.frame': 2189 obs. of 2 variables:
## $ V1: Factor w/ 8 levels "acq","crude",..: 8 4 7 1 3 3 1 3 3 3 ...
## $ V2: Factor w/ 2176 levels "a g edwards inc age st qtr may net shr cts vs cts net vs revs mln vs mln avg shrs vs reuter ",..: 124 381 144 1835 64 251 501 250 396 2080 ...
# rename variables
names(r8train) <- c("Class", "docText")
names(r8test) <- c("Class", "docText")
# convert the document text variable to character type
r8train$docText <- as.character(r8train$docText)
r8test$docText <- as.character(r8test$docText)
# create varible to denote if observation is train or test
r8train$train_test <- c("train")
r8test$train_test <- c("test")
# merge the train/test data
merged <- rbind(r8train, r8test)
# remove objects that are no longer needed
remove(r8train, r8test)
# subset to 3 document classes only for sake of computational expense/memory
# not doing so will result in a stack overflow or long computational times for ML algorithms
merged <- merged[which(merged$Class %in% c("crude","money-fx","trade")),]
# drop unused levels in the response variable
merged$Class <- droplevels(merged$Class)
# counts of each class in the train/test sets
table(merged$Class,merged$train_test)
##
## test train
## crude 121 253
## money-fx 87 206
## trade 75 251
The quality of the tagged dataset is by far the most important component of a text classifier. The dataset needs to be large enough to have an adequate number of documents in each class. The amount of data you need varies depending on the application and objectives. Here, there is still a sufficient amount of data after subsetting the overall data down to just the crude, money-fx, and trade documents.
In general, the dataset also needs to be of a high enough quality in terms of how distinct the documents in the different categories are from each other to allow clear delineation between the categories.
I begin by creating a corpus (large and structured collection of documents containing text) of all train/test documents. Raw data is usually never ready for analysis. Several common text pre-processing tasks are performed:
The tm_map function applies these cleaning tasks to the entire corpus. The other major pre-processing component is creating the term document matrix (TDM) which contains the frequency of terms in each document. That is, rows of the TDM represent documents and columns represent a unique word in the corpus. So, the (i,j)th entry of the TDM contains the number of times “word j” appeared in “document i”. By representing the corpus this way, we open the door for machine learning techniques to be used. In particular, algorithms such as kNN, Naive Bayes, SVM, etc. require that data be represented in the form of a table.
It is common to produce a weighted version of the TDM by term frequency - inverse document frequency (tf-idf). This weighted version takes into account how often a term is used in the entire corpus as well as in a single document. The logic is that if a term is used in the entire corpus frequently, it is probably not as important when differentiating documents. Alternatively, if a word appears rarely in the corpus, it may be an important differentiation even if it only occurs a few times in a document. In this analysis, both the unweighted and tf-idf weighted TDM’s are computed. The performance of various supervised learning techniques using both versions of the TDM are compared later on.
Frequent terms can also be found (those which appear in at least a specified number of documents). Lastly, I convert the TDM’s into data frames for modeling purposes, split them back into the original train/test sets, and append the document labels as the last column in these data frames (using the variable “doc.class”).
# a vector source interprets each element of the vector as a document
sourceData <- VectorSource(merged$docText)
# create the corpus
corpus <- Corpus(sourceData)
# example document before pre-processing
corpus[[20]]$content
## [1] "saudi arabia reiterates commitment to opec accord saudi arabian oil minister hisham nazer reiterated the kingdom s commitment to last december s opec accord to boost world oil prices and stabilize the market the official saudi press agency spa said asked by the agency about the recent fall in free market oil prices nazer said saudi arabia is fully adhering by the accord and it will never sell its oil at prices below the pronounced prices under any circumstance saudi arabia was a main architect of december pact under which opec agreed to cut its total oil output ceiling by pct and return to fixed prices of around dollars a barrel reuter "
# preprocess/clean the training corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lowercase
corpus <- tm_map(corpus, removeNumbers) # remove digits
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, stripWhitespace) # strip extra whitespace
corpus <- tm_map(corpus, removeWords, stopwords('english')) # remove stopwords
# example document after pre-processing
corpus[[20]]$content
## [1] "saudi arabia reiterates commitment opec accord saudi arabian oil minister hisham nazer reiterated kingdom s commitment last december s opec accord boost world oil prices stabilize market official saudi press agency spa said asked agency recent fall free market oil prices nazer said saudi arabia fully adhering accord will never sell oil prices pronounced prices circumstance saudi arabia main architect december pact opec agreed cut total oil output ceiling pct return fixed prices around dollars barrel reuter "
# create term document matrix (tdm)
tdm <- DocumentTermMatrix(corpus)
# inspecting the tdm
dim(tdm) # 993 documents, 9243 terms
## [1] 993 9243
colnames(tdm)[200:210] # sample of columns (words)
## [1] "african" "aftermath" "afternoon" "afterwards" "agcny"
## [6] "aged" "ageing" "agencies" "agency" "agenda"
## [11] "agent"
as.matrix(tdm)[10:20,200:210] # inspect a portion of the tdm
## Terms
## Docs african aftermath afternoon afterwards agcny aged ageing agencies
## 10 0 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0 0
## 14 0 0 0 0 0 0 0 0
## 15 0 0 0 0 0 0 0 0
## 16 0 0 0 0 0 0 0 0
## 17 1 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0
## Terms
## Docs agency agenda agent
## 10 0 0 0
## 11 0 0 0
## 12 0 0 0
## 13 2 0 0
## 14 0 0 0
## 15 0 0 0
## 16 0 0 0
## 17 0 0 0
## 18 0 0 0
## 19 1 0 0
## 20 2 0 0
# create tf-idf weighted version of term document matrix
weightedtdm <- weightTfIdf(tdm)
as.matrix(weightedtdm)[10:20,200:210] # inspect same portion of the weighted tdm
## Terms
## Docs african aftermath afternoon afterwards agcny aged ageing agencies
## 10 0.0000000 0 0 0 0 0 0 0
## 11 0.0000000 0 0 0 0 0 0 0
## 12 0.0000000 0 0 0 0 0 0 0
## 13 0.0000000 0 0 0 0 0 0 0
## 14 0.0000000 0 0 0 0 0 0 0
## 15 0.0000000 0 0 0 0 0 0 0
## 16 0.0000000 0 0 0 0 0 0 0
## 17 0.1276481 0 0 0 0 0 0 0
## 18 0.0000000 0 0 0 0 0 0 0
## 19 0.0000000 0 0 0 0 0 0 0
## 20 0.0000000 0 0 0 0 0 0 0
## Terms
## Docs agency agenda agent
## 10 0.00000000 0 0
## 11 0.00000000 0 0
## 12 0.00000000 0 0
## 13 0.04114098 0 0
## 14 0.00000000 0 0
## 15 0.00000000 0 0
## 16 0.00000000 0 0
## 17 0.00000000 0 0
## 18 0.00000000 0 0
## 19 0.06114790 0 0
## 20 0.11903458 0 0
# find frequent terms: terms that appear in at least "250" documents here, about 25% of the docs
findFreqTerms(tdm, 250)
## [1] "agreement" "also" "bank" "barrel" "billion"
## [6] "bpd" "countries" "crude" "deficit" "dlrs"
## [11] "dollar" "economic" "exchange" "exports" "february"
## [16] "foreign" "government" "imports" "industry" "japan"
## [21] "japanese" "last" "market" "markets" "meeting"
## [26] "minister" "mln" "new" "official" "officials"
## [31] "oil" "one" "opec" "pct" "price"
## [36] "prices" "production" "reuter" "said" "states"
## [41] "stg" "today" "told" "trade" "two"
## [46] "united" "week" "west" "will" "world"
## [51] "year"
# convert tdm's into data frames
tdm <- as.data.frame(inspect(tdm))
weightedtdm <- as.data.frame(inspect(weightedtdm))
# split back into train and test sets
tdmTrain <- tdm[which(merged$train_test == "train"),]
weightedTDMtrain <- weightedtdm[which(merged$train_test == "train"),]
tdmTest <- tdm[which(merged$train_test == "test"),]
weightedTDMtest <- weightedtdm[which(merged$train_test == "test"),]
# remove objects that are no longer needed to conserve memory
remove(tdm,weightedtdm)
# append document labels as last column
tdmTrain$doc.class <- merged$Class[which(merged$train_test == "train")]
tdmTest$doc.class <- merged$Class[which(merged$train_test == "test")]
weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")]
weightedTDMtest$doc.class <- merged$Class[which(merged$train_test == "test")]
Remember, these methods are only available when the training data is tagged/labeled. In practice, people most often have to tag their own data (or hire poor grad students) if accuracy is important.
Repeated 10-fold cross-validation is also used. That is, the training data is split into 10 subsets (folds). Then, the algorithms are run 10 times each time using a different fold as the training data. This process is repeated 3 separate times and the average error across all trials is computed to help reduce overfitting in the models. Notice that, before each call to train, I set the random number seed. That has the effect of using the same resamping data sets for all models which is necessary for comparing models.
Currently, 3 models are developed and compared. They are:
Naive Bayes was also considered but due to its poor performance (approximately 37% accuracy for the unweighted TDM and 49% for the weighted), it is not discussed here.
kNN predicts the class of a new document by finding the k most similar documents (neighbors) in the training data and using the class among those k neighbors that’s the most common. Alternatively, the classes of the neighbors are weighted using the similarity of each neighbor to the new document, where similarity is measured by Euclidean distance or the cosine value between two document vectors.
The assumption of kNN is that the classification of a new document is most similar to the classification of other documents that are “nearby” in the vector space.
kNN doesn’t rely on prior probabilities and is computationally efficient. The main computation is the sorting of training documents in order to find the k-nearest neighbors for the new document. An alternative to kNN is hierarchical clustering if there’s believed to be a hierarchical strucutre to the data’s categories/labels.
# set resampling scheme
ctrl <- trainControl(method="repeatedcv",number = 10, repeats = 3) #,classProbs=TRUE)
# fit a kNN model using the weighted (td-idf) term document matrix
# tuning parameter: K
set.seed(100)
knn.tfidf <- train(doc.class ~ ., data = weightedTDMtrain, method = "knn", trControl = ctrl) #, tuneLength = 20)
# predict on test data
knn.tfidf.predict <- predict(knn.tfidf, newdata = weightedTDMtest)
##################################################################
# ---------------------------------------------------------------
##################################################################
# fit a kNN model using the unweighted TDM
# tuning parameter: K
set.seed(100)
knn <- train(doc.class ~ ., data = tdmTrain, method = "knn", trControl = ctrl) #, tuneLength = 20)
# predict on test data
knn.predict <- predict(knn, newdata = tdmTest)
From Wikipedia - “Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.” The main task of an SVM is finding the optimal hyperplane with maximum distance from the nearest training patterns which are the support vectors. Only the support vectors are used for prediction/categorization which makes it computationally efficient.
cons of SVM:
pros of SVM:
important: A SVM is only directly applicable for two-class tasks. If the variable of interest has more than two classes, a multiclass implementation must be performed. In this analysis,there are more than two classes to classify and thus I am implementing a multiclass SVM. The dominant approach for doing so is to reduce the single multiclass problem into multiple binary classification problems. Common methods for such reduction include:
Here, I fit SVM’s with the caret package and compare the performance of linear and radial basis function kernels.
# set resampling scheme: 10-fold cross-validation, 3 times
ctrl <- trainControl(method="repeatedcv", number = 10, repeats = 3) #,classProbs=TRUE)
# fit a multiclass SVM using the weighted (td-idf) term document matrix
# kernel: linear
# tuning parameters: C
set.seed(100)
svm.tfidf.linear <- train(doc.class ~ . , data=weightedTDMtrain, trControl = ctrl, method = "svmLinear")
# fit another multiclass SVM using the weighted (td-idf) term document matrix
# kernel: radial basis function
# tuning parameters: sigma, C
set.seed(100)
svm.tfidf.radial <- train(doc.class ~ . , data=weightedTDMtrain, trControl = ctrl, method = "svmRadial")
# predict on test data
svm.tfidf.linear.predict <- predict(svm.tfidf.linear,newdata = weightedTDMtest)
svm.tfidf.radial.predict <- predict(svm.tfidf.radial,newdata = weightedTDMtest)
##################################################################
# ---------------------------------------------------------------
##################################################################
# fit a multiclass SVM using the unweighted TDM
# kernel: linear
# tuning parameters: C
set.seed(100)
svm.linear <- train(doc.class ~ . , data=tdmTrain, trControl = ctrl, method = "svmLinear")
# fit another multiclass SVM using the unweighted TDM
# kernel: radial basis function
# tuning parameters: sigma, C
set.seed(100)
svm.radial <- train(doc.class ~ . , data=tdmTrain, trControl = ctrl, method = "svmRadial")
# predict on test data
svm.linear.predict <- predict(svm.linear,newdata = tdmTest)
svm.radial.predict <- predict(svm.radial,newdata = tdmTest)
From Wikipedia - “A decision tree is a flowchart-like structure in which each internal node represents a”test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules."
Decision trees work by iteratively finding the variable (in a group of variables) that best splits the outcome into two groups. The recursion stops when all groups are sufficiently small/homogeneous/pure. This produces a non-linear model and the classification tree uses interactions between variables. Usually, the ultimate group/leaf will depend on many variables. Thus, decision trees can be viewed as predicive models which map observations about an item to conclusions about the items target value.
Different algorithms use different metrics for measuring the “best” split. Generally, they measure homogeneity of the target variable within the subsets. These metrics are applied to each candidate subset, and the resulting values are combinded (e.g. by averaging) to provide a measure of the quality of the split. Some of the metrics are:
pros of decision trees:
cons of decision trees:
note: the data frames are wrapped by data.frame() in order to ensure syntactically valid variable names. rpart requires this and if it is not done errors will occur.
# set resampling scheme: 10-fold cross-validatoin, 3 times
ctrl <- trainControl(method="repeatedcv", number = 10, repeats = 3) #,classProbs=TRUE)
# fit a decision tree using the weighted (td-idf) term document matrix
# tuning parameter: cp
set.seed(100)
tree.tfidf <- train(doc.class ~ . , data = data.frame(weightedTDMtrain), method = "rpart",trControl = ctrl )
# predict on test data
tree.tfidf.predict <- predict(tree.tfidf,newdata = data.frame(weightedTDMtest))
##################################################################
# ---------------------------------------------------------------
##################################################################
# fit a decision tree using the unweighted TDM
# tuning parameter: cp
set.seed(100)
tree <- train(doc.class ~ . , data = data.frame(tdmTrain), method = "rpart", trControl = ctrl)
# predict on test data
tree.predict <- predict(tree,newdata = data.frame(tdmTest))
# --------------- Weighted (tfi-idf) TDM ---------------------#
# Output of kNN fit
knn.tfidf
## k-Nearest Neighbors
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.8554163 0.7821269
## 7 0.8488569 0.7718249
## 9 0.8252937 0.7359511
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
# plot accuracy vs. number of neighbors
plot(knn.tfidf)
# --------------- Unweighted TDM ---------------------#
# output the kNN fit
knn
## k-Nearest Neighbors
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9182156 0.8771093
## 7 0.9177392 0.8764557
## 9 0.9196563 0.8794357
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
# plot accuracy vs. number of neighbors
plot(knn)
# confusion matrices allow you to evaluate accuracy and other metrics
confusionMatrix(knn.predict, tdmTest$doc.class) # unweighted TDM
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 113 0 0
## money-fx 7 83 5
## trade 1 4 70
##
## Overall Statistics
##
## Accuracy : 0.9399
## 95% CI : (0.9056, 0.9646)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9084
## Mcnemar's Test P-Value : 0.04377
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.9339 0.9540 0.9333
## Specificity 1.0000 0.9388 0.9760
## Pos Pred Value 1.0000 0.8737 0.9333
## Neg Pred Value 0.9529 0.9787 0.9760
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.3993 0.2933 0.2473
## Detection Prevalence 0.3993 0.3357 0.2650
## Balanced Accuracy 0.9669 0.9464 0.9546
confusionMatrix(knn.tfidf.predict, weightedTDMtest$doc.class) # weighted TDM
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 103 0 2
## money-fx 1 78 2
## trade 17 9 71
##
## Overall Statistics
##
## Accuracy : 0.8905
## 95% CI : (0.8481, 0.9243)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8347
## Mcnemar's Test P-Value : 0.0006141
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.8512 0.8966 0.9467
## Specificity 0.9877 0.9847 0.8750
## Pos Pred Value 0.9810 0.9630 0.7320
## Neg Pred Value 0.8989 0.9554 0.9785
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.3640 0.2756 0.2509
## Detection Prevalence 0.3710 0.2862 0.3428
## Balanced Accuracy 0.9194 0.9406 0.9108
# print info about parameters, etc. used in the model with highest accuracy
knn$results # error rate and values of tuning parameter
## k Accuracy Kappa AccuracySD KappaSD
## 1 5 0.9182156 0.8771093 0.03277920 0.04908560
## 2 7 0.9177392 0.8764557 0.03702300 0.05547952
## 3 9 0.9196563 0.8794357 0.03783284 0.05656316
knn$bestTune # final tuning parameter
## k
## 3 9
knn$metric # metric used to select optimal model
## [1] "Accuracy"
knn$times # a list of execution times
## $everything
## user system elapsed
## 453.83 1.83 455.88
##
## $final
## user system elapsed
## 0.12 0.00 0.13
##
## $prediction
## [1] NA NA NA
The model that does NOT use the weighted term document matrix (that is, the one that uses just raw frequency counts in the term document matrix) had higher accuracy.
# output the fit of SVM's using the weighted (tf-idf) TDM
svm.tfidf.linear # linear kernel
## Support Vector Machines with Linear Kernel
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9671054 0.9502858
##
## Tuning parameter 'C' was held constant at a value of 1
##
svm.tfidf.radial # radial basis function kernel
## Support Vector Machines with Radial Basis Function Kernel
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.7651909 0.6400185
## 0.50 0.8713132 0.8043525
## 1.00 0.9464666 0.9190532
##
## Tuning parameter 'sigma' was held constant at a value of 1.158257
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 1.158257 and C = 1.
plot(svm.tfidf.radial)
# output the fit of SVM's using the unweighted TDM
svm.linear
## Support Vector Machines with Linear Kernel
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9684738 0.9525237
##
## Tuning parameter 'C' was held constant at a value of 1
##
svm.radial
## Support Vector Machines with Radial Basis Function Kernel
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8488007 0.7726513
## 0.50 0.8924019 0.8383231
## 1.00 0.9294970 0.8940379
##
## Tuning parameter 'sigma' was held constant at a value of 0.003569925
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.003569925 and C = 1.
plot(svm.radial)
# confusion matrices allow you to evaluate accuracy and other metrics
confusionMatrix(svm.linear.predict, tdmTest$doc.class) # linear kernel, unweighted TDM
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 119 0 2
## money-fx 1 84 2
## trade 1 3 71
##
## Overall Statistics
##
## Accuracy : 0.9682
## 95% CI : (0.9405, 0.9854)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9513
## Mcnemar's Test P-Value : 0.6746
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.9835 0.9655 0.9467
## Specificity 0.9877 0.9847 0.9808
## Pos Pred Value 0.9835 0.9655 0.9467
## Neg Pred Value 0.9877 0.9847 0.9808
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.4205 0.2968 0.2509
## Detection Prevalence 0.4276 0.3074 0.2650
## Balanced Accuracy 0.9856 0.9751 0.9637
confusionMatrix(svm.radial.predict, tdmTest$doc.class) # radial kernel, unweighted TDM
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 118 0 2
## money-fx 2 87 0
## trade 1 0 73
##
## Overall Statistics
##
## Accuracy : 0.9823
## 95% CI : (0.9593, 0.9942)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9729
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.9752 1.0000 0.9733
## Specificity 0.9877 0.9898 0.9952
## Pos Pred Value 0.9833 0.9775 0.9865
## Neg Pred Value 0.9816 1.0000 0.9904
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.4170 0.3074 0.2580
## Detection Prevalence 0.4240 0.3145 0.2615
## Balanced Accuracy 0.9814 0.9949 0.9843
confusionMatrix(svm.tfidf.linear.predict, weightedTDMtest$doc.class) # linear kernel, weighted TDM
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 119 0 0
## money-fx 0 80 1
## trade 2 7 74
##
## Overall Statistics
##
## Accuracy : 0.9647
## 95% CI : (0.936, 0.9829)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.946
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.9835 0.9195 0.9867
## Specificity 1.0000 0.9949 0.9567
## Pos Pred Value 1.0000 0.9877 0.8916
## Neg Pred Value 0.9878 0.9653 0.9950
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.4205 0.2827 0.2615
## Detection Prevalence 0.4205 0.2862 0.2933
## Balanced Accuracy 0.9917 0.9572 0.9717
confusionMatrix(svm.tfidf.radial.predict, weightedTDMtest$doc.class) # radial kernel, weighted TDM
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 120 10 0
## money-fx 0 73 1
## trade 1 4 74
##
## Overall Statistics
##
## Accuracy : 0.9435
## 95% CI : (0.9098, 0.9673)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9129
## Mcnemar's Test P-Value : 0.00509
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.9917 0.8391 0.9867
## Specificity 0.9383 0.9949 0.9760
## Pos Pred Value 0.9231 0.9865 0.9367
## Neg Pred Value 0.9935 0.9330 0.9951
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.4240 0.2580 0.2615
## Detection Prevalence 0.4594 0.2615 0.2792
## Balanced Accuracy 0.9650 0.9170 0.9813
# print various info about parameters, etc. used in the model with highest accuracy
svm.radial$results # error rate and values of tuning parameter
## sigma C Accuracy Kappa AccuracySD KappaSD
## 1 0.003569925 0.25 0.8488007 0.7726513 0.04676278 0.07002347
## 2 0.003569925 0.50 0.8924019 0.8383231 0.04262555 0.06379599
## 3 0.003569925 1.00 0.9294970 0.8940379 0.03171192 0.04760602
svm.radial$bestTune # final tuning parameter
## sigma C
## 3 0.003569925 1
svm.radial$metric # metric used to select optimal model
## [1] "Accuracy"
svm.radial$times # a list of execution times
## $everything
## user system elapsed
## 694.94 57.41 363.74
##
## $final
## user system elapsed
## 6.60 0.33 3.88
##
## $prediction
## [1] NA NA NA
The radial basis function kernel and unweighted term document matrix to train the SVM had the highest accuracy (98.23%). The few cases it did predict wrong were trade and crude classes which is encouraging.
# output the decision tree fit using the weighted (tf-idf) TDM
tree.tfidf
## CART
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03719912 0.892494 0.8388971
## 0.36323851 0.778399 0.6607596
## 0.47483589 0.482008 0.1954103
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03719912.
plot(tree.tfidf) # accuracy vs. complexity parameter values
# output the decision tree fit using the unweighted TDM
tree
## CART
##
## 710 samples
## 9243 predictors
## 3 classes: 'crude', 'money-fx', 'trade'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 640, 639, 638, 637, 638, 639, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.04376368 0.8759610 0.8134491
## 0.33916849 0.7529400 0.6206175
## 0.46170678 0.4882577 0.2053925
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04376368.
plot(tree) # accuracy vs. complexity parameter values
# confusion matrices allow you to evaluate accuracy and other metrics
confusionMatrix(tree.predict , tdmTest$doc.class) # unweighted TDM tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 116 0 5
## money-fx 5 78 5
## trade 0 9 65
##
## Overall Statistics
##
## Accuracy : 0.9152
## 95% CI : (0.8764, 0.9449)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.87
## Mcnemar's Test P-Value : 0.01098
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.9587 0.8966 0.8667
## Specificity 0.9691 0.9490 0.9567
## Pos Pred Value 0.9587 0.8864 0.8784
## Neg Pred Value 0.9691 0.9538 0.9522
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.4099 0.2756 0.2297
## Detection Prevalence 0.4276 0.3110 0.2615
## Balanced Accuracy 0.9639 0.9228 0.9117
confusionMatrix(tree.tfidf.predict, weightedTDMtest$doc.class) # weighted TDM tree
## Confusion Matrix and Statistics
##
## Reference
## Prediction crude money-fx trade
## crude 115 0 4
## money-fx 6 84 6
## trade 0 3 65
##
## Overall Statistics
##
## Accuracy : 0.9329
## 95% CI : (0.8971, 0.9591)
## No Information Rate : 0.4276
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8971
## Mcnemar's Test P-Value : 0.01173
##
## Statistics by Class:
##
## Class: crude Class: money-fx Class: trade
## Sensitivity 0.9504 0.9655 0.8667
## Specificity 0.9753 0.9388 0.9856
## Pos Pred Value 0.9664 0.8750 0.9559
## Neg Pred Value 0.9634 0.9840 0.9535
## Prevalence 0.4276 0.3074 0.2650
## Detection Rate 0.4064 0.2968 0.2297
## Detection Prevalence 0.4205 0.3392 0.2403
## Balanced Accuracy 0.9629 0.9521 0.9261
# print info about parameters, etc. used in the model with highest accuracy
tree.tfidf$results # error rate and values of tuning parameter
## cp Accuracy Kappa AccuracySD KappaSD
## 1 0.03719912 0.892494 0.8388971 0.03210086 0.04759729
## 2 0.36323851 0.778399 0.6607596 0.10880605 0.17153519
## 3 0.47483589 0.482008 0.1954103 0.14614528 0.22758974
tree.tfidf$bestTune # final tuning parameter
## cp
## 1 0.03719912
tree.tfidf$metric # metric used to select optimal model
## [1] "Accuracy"
tree.tfidf$times # a list of execution times
## $everything
## user system elapsed
## 777.58 25.75 781.58
##
## $final
## user system elapsed
## 8.99 0.20 8.35
##
## $prediction
## [1] NA NA NA
The decision tree using the td-idf weighted matrix had a higher accuracy (93.29%)
With respect to accuracy, a SVM using a radial basis function kernel and the unweighted term document matrix for training was the best performing model.
Several things can be done going forward to strengthen the analysis, including:
Do these models need to be retrained every single time a new document comes in? The concern is that new documents will have words that aren’t in the corpus that was used to train the model. Right now, my guess is that you could just eliminate any of these new words as there should be very few, if any since our corpus was large.