Load datasource and libraries.
library(RTextTools)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
##
## The following object is masked from 'package:base':
##
## backsolve
sb <- read.csv("spambase.data", header=FALSE)
I took a random sample of 600 cases from the spambase.data. The sample size was determined empirically to be a number my system could handle without excessive processing time and still provide good result stats.
ds <- sb[sample(1:nrow(sb), 600,replace=FALSE),]
Separate the class labels, v58 (spam 1/0), from the sample data. Leaving the labels in weighs the results, I believe.
labels <- ds[58]
tdata <- ds[,1:57]
From the sample, use the first 70% as training data and the remaining as test data.
container <- create_container(tdata, t(labels), trainSize=1:420, testSize = 421:600, virgin=FALSE)
I decided to try every available algorithm except NNET. NNET required special data configuration that wasn’t compatible with some of the other algorithms.
models <- train_models(container, algorithms=c("BAGGING", "BOOSTING", "GLMNET", "MAXENT", "RF", "SLDA", "SVM", "TREE"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)
summary(analytics)
## ENSEMBLE SUMMARY
##
## n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1 1.00 0.94
## n >= 2 1.00 0.94
## n >= 3 1.00 0.94
## n >= 4 1.00 0.94
## n >= 5 0.97 0.96
## n >= 6 0.91 0.96
## n >= 7 0.80 0.97
## n >= 8 0.54 0.98
##
##
## ALGORITHM PERFORMANCE
##
## SVM_PRECISION SVM_RECALL SVM_FSCORE
## 0.730 0.745 0.730
## SLDA_PRECISION SLDA_RECALL SLDA_FSCORE
## 0.785 0.755 0.765
## LOGITBOOST_PRECISION LOGITBOOST_RECALL LOGITBOOST_FSCORE
## 0.920 0.930 0.925
## BAGGING_PRECISION BAGGING_RECALL BAGGING_FSCORE
## 0.920 0.920 0.920
## FORESTS_PRECISION FORESTS_RECALL FORESTS_FSCORE
## 0.965 0.965 0.965
## GLMNET_PRECISION GLMNET_RECALL GLMNET_FSCORE
## 0.890 0.865 0.875
## TREE_PRECISION TREE_RECALL TREE_FSCORE
## 0.875 0.880 0.875
## MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 0.895 0.895 0.895
We see that SVM was the worst performer on my system and Random Forest was the best.
K-Means Cluster
Let’s see what a K-means cluster analysis finds in the dataset. I used the entire spambase with the class row v58 removed, then compared the results to the actual classes in a table. First, note the ratio of spam/not-spam in spambase.data.
table(sb[,58])
##
## 0 1
## 2788 1813
library(stats)
kdata<-sb[,1:57]
model <- kmeans(x = kdata, centers = 3)
table(model$cluster, sb[,58])
##
## 0 1
## 1 122 322
## 2 14 43
## 3 2652 1448
The left row labels are generated by the algorithm. The data columns show the count of cases classified as not-spam (0) vs spam (1) in each cluster. If the data were merely random, the counts would be more or less evenly distributed in the table cells. Here we see clustering in one group, indicating similarity. Ideally, we want to see a similar 0/1 ratio to the test data.