Week 11, Textmining

Load datasource and libraries.

library(RTextTools)

## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## 
## The following object is masked from 'package:base':
## 
##     backsolve

sb <- read.csv("spambase.data", header=FALSE)

I took a random sample of 600 cases from the spambase.data. The sample size was determined empirically to be a number my system could handle without excessive processing time and still provide good result stats.

ds <- sb[sample(1:nrow(sb), 600,replace=FALSE),]

Separate the class labels, v58 (spam 1/0), from the sample data. Leaving the labels in weighs the results, I believe.

labels <- ds[58]
tdata <- ds[,1:57]

From the sample, use the first 70% as training data and the remaining as test data.

container <- create_container(tdata, t(labels), trainSize=1:420, testSize = 421:600, virgin=FALSE)

I decided to try every available algorithm except NNET. NNET required special data configuration that wasn’t compatible with some of the other algorithms.

models <- train_models(container, algorithms=c("BAGGING", "BOOSTING", "GLMNET", "MAXENT", "RF", "SLDA", "SVM", "TREE"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)
summary(analytics)

## ENSEMBLE SUMMARY
## 
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                1.00              0.94
## n >= 2                1.00              0.94
## n >= 3                1.00              0.94
## n >= 4                1.00              0.94
## n >= 5                0.97              0.96
## n >= 6                0.91              0.96
## n >= 7                0.80              0.97
## n >= 8                0.54              0.98
## 
## 
## ALGORITHM PERFORMANCE
## 
##        SVM_PRECISION           SVM_RECALL           SVM_FSCORE 
##                0.730                0.745                0.730 
##       SLDA_PRECISION          SLDA_RECALL          SLDA_FSCORE 
##                0.785                0.755                0.765 
## LOGITBOOST_PRECISION    LOGITBOOST_RECALL    LOGITBOOST_FSCORE 
##                0.920                0.930                0.925 
##    BAGGING_PRECISION       BAGGING_RECALL       BAGGING_FSCORE 
##                0.920                0.920                0.920 
##    FORESTS_PRECISION       FORESTS_RECALL       FORESTS_FSCORE 
##                0.965                0.965                0.965 
##     GLMNET_PRECISION        GLMNET_RECALL        GLMNET_FSCORE 
##                0.890                0.865                0.875 
##       TREE_PRECISION          TREE_RECALL          TREE_FSCORE 
##                0.875                0.880                0.875 
## MAXENTROPY_PRECISION    MAXENTROPY_RECALL    MAXENTROPY_FSCORE 
##                0.895                0.895                0.895

We see that SVM was the worst performer on my system and Random Forest was the best.

K-Means Cluster

Let’s see what a K-means cluster analysis finds in the dataset. I used the entire spambase with the class row v58 removed, then compared the results to the actual classes in a table. First, note the ratio of spam/not-spam in spambase.data.

table(sb[,58])

## 
##    0    1 
## 2788 1813

library(stats)
kdata<-sb[,1:57]
model <- kmeans(x = kdata, centers = 3)
table(model$cluster, sb[,58])

##    
##        0    1
##   1  122  322
##   2   14   43
##   3 2652 1448

The left row labels are generated by the algorithm. The data columns show the count of cases classified as not-spam (0) vs spam (1) in each cluster. If the data were merely random, the counts would be more or less evenly distributed in the table cells. Here we see clustering in one group, indicating similarity. Ideally, we want to see a similar 0/1 ratio to the test data.

Week 11, Textmining

FD