Started with reading the Spambase data and loading into a table and the conversion of tabel to a Data Frame using dplyr package. John DeBase has suggested a wonderful package caret so following through the notes provided at caret package from http://topepo.github.io/caret/index.html has helped alot in the code below:
Once we have data frame ready, naming the 58th column as spam is very handy to manipulate the data lateron.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 3.2.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.2
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(MASS)
## Warning: package 'MASS' was built under R version 3.2.2
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(nnet)
spamdata<-read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data",header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)
spamdata=tbl_df(spamdata)
colnames(spamdata)[58]="spam"
spamdata$spam=as.factor(spamdata$spam)
Now when I have the data frame ready to analyze, I have to take the next step as creating a training and testing data set. Since it takes lot longer to process the data smaller sample set of 1000 records considered. Training and testing data set are divided as 70:30 ratio using createDataPartition
set.seed(1234)
spam_sample = spamdata[sample(nrow(spamdata), 1000), ]
training_index = createDataPartition(spam_sample$spam, p=.7, list=FALSE)
training_set = spam_sample[training_index, ]
testing_set = spam_sample[-training_index, ]
#spamdata <- rbind(training_set, testing_set)
#outcomes <- spam_sample$spam
#spamdata <- subset(spam_sample, select = -spam)
#matrix <- as.DocumentTermMatrix(spam_sample, weightTf)
#container <- create_container(matrix, t(outcomes), trainSize = 1:3500, testSize = 3501:4000, virgin=FALSE)
Now creation of trainig control and training the model is the next steps as shown below:
tr_ctrl = trainControl(method = "repeatedcv",number = 10,repeats = 10)
#train the models
forest_train = train(spam ~ .,
data=training_set,
method="rf",
trControl=tr_ctrl)
lda_train = train(spam ~ .,
data=training_set,
method="lda",
trControl=tr_ctrl)
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
Predict model preparation based on trained model and testing set using predict function from caret package
pred_forest = predict(forest_train, testing_set[,-58])
pred_lda = predict(lda_train, testing_set[,-58])
#pred_nnet = predict(nnet_train, testing_set[,-58])
#pred_qda=predict(qda_train,testing_set[,-58])
#pred_fda=predict(fda_train,testing_set[,-58])
#pred_mda=predict(mda_train,testing_set[,-58])
#pred_glm=predict(glm_train,testing_set[,-58])
confusionMatrix(pred_forest ,testing_set$spam)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 176 19
## 1 6 98
##
## Accuracy : 0.9164
## 95% CI : (0.879, 0.9452)
## No Information Rate : 0.6087
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8209
## Mcnemar's Test P-Value : 0.0164
##
## Sensitivity : 0.9670
## Specificity : 0.8376
## Pos Pred Value : 0.9026
## Neg Pred Value : 0.9423
## Prevalence : 0.6087
## Detection Rate : 0.5886
## Detection Prevalence : 0.6522
## Balanced Accuracy : 0.9023
##
## 'Positive' Class : 0
##
confusionMatrix(pred_lda,testing_set$spam)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 173 18
## 1 9 99
##
## Accuracy : 0.9097
## 95% CI : (0.8713, 0.9396)
## No Information Rate : 0.6087
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8078
## Mcnemar's Test P-Value : 0.1237
##
## Sensitivity : 0.9505
## Specificity : 0.8462
## Pos Pred Value : 0.9058
## Neg Pred Value : 0.9167
## Prevalence : 0.6087
## Detection Rate : 0.5786
## Detection Prevalence : 0.6388
## Balanced Accuracy : 0.8984
##
## 'Positive' Class : 0
##
#confusionMatrix(pred_nnet,testing_set$spam)
#confusionMatrix(pred_qda,testing_set$spam)
#confusionMatrix(pred_fda,testing_set$spam)
#confusionMatrix(pred_mda,testing_set$spam)
#confusionMatrix(pred_glm,testing_set$spam)
Summary: I tried my hands on Random Forest and Linear Discriminant Analysis and both stood at 91 and 90% accuracy, so considering either of it will be fine for further refinement of model. This is my first experience with algorithm and the biggest challenge which I am facing is to determine which algorithm to choose from and I believe we will become more conversant with this as we get into teh depth of Machine Learning.
I have kept all the other codes untouched to ensure that I should work on it and complete rest of the analysis too with different algorithms.