CAP 5610 Assignment#2 Danilo Martinez

Machine Problem 2 . Applying the Naïve Gaussian Bayes classifier on MNIST. . Use two solutions (MLE/MAP) to estimate prior distribution P(Y) . Estimate the independent Gaussian density for each dimension of feature vector . Report error rates with three test protocol, you need to tune alpha d for d=0,.,9 . Test/training . Test/validation/training . 5-fold/10-fold cross-validation . Tune alpha d = 1%, 2%, 4%, 8%, 16% of training examples. . Compare your result with KNN, which is better? Why? . KNN lost correlations between different features (pixels in MNIST case)? . Do you observe over fitting? Do {alpha d} help to reduce over fitting?

Bayes’ theorem is a mathematical formula for determining conditional probability named after 18th-century British mathematician Thomas Bayes. The theorem provides a way to revise existing predictions or theories given new or additional evidence. The formula is written as P(A|B) = P(B|A) * P(A) / P(B). P(A) and P(B) are the probabilities of A and B without regard to each other, or they are independent. P(B|A) is the probability that B will occur given A is true. Finally, the answer, P(A|B) is the conditional probability of A occurring given B is true. Bayes theorem has many uses in the fields of finance, economics, and medicine.

Maximum Likelihood Estimate or MLE is a totally analytic maximization procedure. It is the likelihood of a set of data is the probability of obtaining that particular set of data, given the chosen probability distribution model. MLE has very desirable large sample properties as it becomes an unbiased minimum variance estimator as the sample size increases, it has an approximate normal distribution and approximate sample variances that can be calculated and used to generate confidence bounds, and it can be used to test hypotheses about models and parameters. There are drawbacks with MLE. For example, with small numbers of failures (less than 5, and sometimes less than 10 is small), MLE can be heavily biased and the large sample optimality properties do not apply. With small samples, MLE may not be very precise and may even generate a line that lies above or below the data points. In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data.

The maximum likelihood estimate (MLE) of a parameter is the value of the parameter that maximizes the likelihood, where the likelihood is a function of the parameter and is actually equal to the probability of the data conditioning on that parameter. Maximum a posteriori (MAP) estimation is the value of the parameter that maximizes the entire posterior distribution (which is calculated using the likelihood). A MAP estimate is the mode of the posterior distribution. There is no difference between the MLE and the MAP estimate if the prior distribution we were assuming was a constant. The problem with MLE is that it overfits the data. The variance of the parameter estimates is high, and the outcome of the parameter estimate is sensitive to random variations in data. This becomes worse with small amounts of data. To deal with this, it usually helps to add regularisation to MLE to reduce variance by introducing bias into the estimate. In maximum a posteriori (MAP), this regularisation is achieved by assuming that the parameters themselves are also drawn from a random process, like the data. The prior beliefs about the parameters determine what this random process looks like.

To understand the following output in R, I have included information below that help to interpret the output. This is a list of rates that are often computed from a confusion matrix for a binary classifier in R: True positives (TP): These are cases in which we predicted yes, and it is yes. True negatives (TN): We predicted no, and it is no. False positives (FP): We predicted yes, but it is no. (Also known as a “Type I error.”) False negatives (FN): We predicted no, but it is yes. (Also known as a “Type II error.”) Accuracy: Overall, how often is the classifier correct? (TP+TN)/total. Misclassification Rate: Overall, how often is it wrong? Equivalent to 1 minus Accuracy, and also known as “Error Rate.” (FP+FN)/total True Positive Rate: When it’s actually yes, how often does it predict yes? Also known as “Sensitivity” or “Recall.” TP/actual yes. False Positive Rate: When it’s actually no, how often does it predict yes? FP/actual no. Specificity: When it’s actually no, how often does it predict no? Equivalent to 1 minus False Positive Rate TN/actual no. Precision: When it predicts yes, how often is it correct? TP/predicted yes. Prevalence: How often does the yes condition actually occur in our sample? actual yes/total. Positive Predictive Value: This is very similar to precision, except that it takes prevalence into account. In the case where the classes are perfectly balanced (meaning the prevalence is 50%), the positive predictive value (PPV) is equivalent to precision. Null Error Rate: This is how often you would be wrong if you always predicted the majority class. This can be a useful baseline metric to compare your classifier against. However, the best classifier for a particular application will sometimes have a higher error rate than the null error rate, as demonstrated by the Accuracy Paradox. Cohen’s Kappa: This is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. In other words, a model will have a high Kappa score if there is a big difference between the accuracy and the null error rate. F Score: This is a weighted average of the true positive rate and precision. ROC Curve: This is a commonly used graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class.

In relation to Bayesian statistics, the sensitivity and specificity are the conditional probabilities, the prevalence is the prior, and the positive/negative predicted values are the posterior probabilities. MLE is the prior and MAP is the Balanced Accuracy.

1st Test Protocol Training set 50,000, Test set 10,000, Alpha = .01, .02, .04, .08, .16.

#Loading necessary Libraries
library(e1071)
library(caret)
library(doParallel)
# Setting up parallel processing.
cl <-makeCluster(detectCores())
registerDoParallel(cl)
#Functions to Load files and labels
loadingmnist <-function()
  {
  loadingimage <-function(filename)
    {
    ret =list()
    f =file(filename,'rb')
    readBin(f,'integer',n=1,size=4,endian='big')
    ret$n =readBin(f,'integer',n=1,size=4,endian='big')
    nrow =readBin(f,'integer',n=1,size=4,endian='big')
    ncol =readBin(f,'integer',n=1,size=4,endian='big')
    x =readBin(f,'integer',n=ret$n*nrow*ncol,size=1,
    signed=F)
    ret$x =matrix(x, ncol=nrow*ncol, byrow=T)
    close(f)
    ret
    }
  loadinglabel <-function(filename)
    {
    f = file(filename,'rb')
    readBin(f,'integer',n=1,size=4,endian='big')
    n = readBin(f,'integer',n=1,size=4,endian='big')
    y = readBin(f,'integer',n=n,size=1,signed=F)
    close(f)
    y
    }
  #Loading files and labels
  train <<-loadingimage('train-images.idx3-ubyte')
  train$y <<-loadinglabel('train-labels.idx1-ubyte')
}
#Funtion to display digit
displaydigit <- function(arr784, col=gray(12:1/12), ...)
  {
  image(matrix(arr784, nrow=28)[,28:1], col=col, ...)
  }
#Establishing train and test as data frames
train <-data.frame()
#Calling the functions to load the database
loadingmnist()
#Normalizing: X=(X - min)/(max - min)=>X=(X-0)/(255-0)=> X=X/255.
train$x <-train$x/255
#Setting seed to produce same results
set.seed(123)
#Dividing data with training having 50k and testing       having 10k.
inTrain = data.frame(y=train$y, train$x)
inTrain$y <- as.factor(train$y)
trainIndex = createDataPartition(inTrain$y, p=0.833245, list=FALSE)
training = inTrain[trainIndex,]
testing = inTrain[-trainIndex,]
rm(train)
proc1=double()
proc2=double()
proc3=double()
proc4=double()

#Creating model using Naive Bayes Classifier
i=.01
while (i <=.16)
{
  cat("Processing Naive Bayes Classifier tuning alpha = ",    i, "\n",sep = "")
  nb_model<- naiveBayes(y~.,data = training,laplace = i)
  results<-predict(nb_model,newdata=testing,type=c("class"))
  output  <-confusionMatrix(results,testing$y)
  print("Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)")
  print(output$byClass)
  print("Overall Error Rate")
  acc=output$overall['Accuracy']
  j=i*100
  proc1[j]=c(1-acc)
  print(proc1[j])
  cat("\n")
  i=i*2
}

## Processing Naive Bayes Classifier tuning alpha = 0.01
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.9007092   0.9799179      0.8308411      0.9890258 0.8308411
## Class: 1   0.9581851   0.9549347      0.7291808      0.9944855 0.7291808
## Class: 2   0.4864048   0.9940047      0.8994413      0.9461059 0.8994413
## Class: 3   0.5812133   0.9840722      0.8059701      0.9537947 0.8059701
## Class: 4   0.3223819   0.9929094      0.8306878      0.9314072 0.8306878
## Class: 5   0.1683278   0.9945037      0.7524752      0.9233517 0.7524752
## Class: 6   0.9148073   0.9601731      0.7153053      0.9903879 0.7153053
## Class: 7   0.4243295   0.9956454      0.9190871      0.9368565 0.9190871
## Class: 8   0.6184615   0.8821053      0.3617277      0.9553582 0.3617277
## Class: 9   0.9193548   0.8582371      0.4166286      0.9897580 0.4166286
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.9007092 0.8643656     0.0987         0.0889
## Class: 1 0.9581851 0.8281430     0.1124         0.1077
## Class: 2 0.4864048 0.6313725     0.0993         0.0483
## Class: 3 0.5812133 0.6753837     0.1022         0.0594
## Class: 4 0.3223819 0.4644970     0.0974         0.0314
## Class: 5 0.1683278 0.2751131     0.0903         0.0152
## Class: 6 0.9148073 0.8028482     0.0986         0.0902
## Class: 7 0.4243295 0.5806029     0.1044         0.0443
## Class: 8 0.6184615 0.4564724     0.0975         0.0603
## Class: 9 0.9193548 0.5734046     0.0992         0.0912
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1070         0.9403136
## Class: 1               0.1477         0.9565599
## Class: 2               0.0537         0.7402047
## Class: 3               0.0737         0.7826427
## Class: 4               0.0378         0.6576457
## Class: 5               0.0202         0.5814157
## Class: 6               0.1261         0.9374902
## Class: 7               0.0482         0.7099874
## Class: 8               0.1667         0.7502834
## Class: 9               0.2189         0.8887960
## [1] "Overall Error Rate"
## [1] 0.3631
## 
## Processing Naive Bayes Classifier tuning alpha = 0.02
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.9007092   0.9799179      0.8308411      0.9890258 0.8308411
## Class: 1   0.9581851   0.9549347      0.7291808      0.9944855 0.7291808
## Class: 2   0.4864048   0.9940047      0.8994413      0.9461059 0.8994413
## Class: 3   0.5812133   0.9840722      0.8059701      0.9537947 0.8059701
## Class: 4   0.3223819   0.9929094      0.8306878      0.9314072 0.8306878
## Class: 5   0.1683278   0.9945037      0.7524752      0.9233517 0.7524752
## Class: 6   0.9148073   0.9601731      0.7153053      0.9903879 0.7153053
## Class: 7   0.4243295   0.9956454      0.9190871      0.9368565 0.9190871
## Class: 8   0.6184615   0.8821053      0.3617277      0.9553582 0.3617277
## Class: 9   0.9193548   0.8582371      0.4166286      0.9897580 0.4166286
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.9007092 0.8643656     0.0987         0.0889
## Class: 1 0.9581851 0.8281430     0.1124         0.1077
## Class: 2 0.4864048 0.6313725     0.0993         0.0483
## Class: 3 0.5812133 0.6753837     0.1022         0.0594
## Class: 4 0.3223819 0.4644970     0.0974         0.0314
## Class: 5 0.1683278 0.2751131     0.0903         0.0152
## Class: 6 0.9148073 0.8028482     0.0986         0.0902
## Class: 7 0.4243295 0.5806029     0.1044         0.0443
## Class: 8 0.6184615 0.4564724     0.0975         0.0603
## Class: 9 0.9193548 0.5734046     0.0992         0.0912
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1070         0.9403136
## Class: 1               0.1477         0.9565599
## Class: 2               0.0537         0.7402047
## Class: 3               0.0737         0.7826427
## Class: 4               0.0378         0.6576457
## Class: 5               0.0202         0.5814157
## Class: 6               0.1261         0.9374902
## Class: 7               0.0482         0.7099874
## Class: 8               0.1667         0.7502834
## Class: 9               0.2189         0.8887960
## [1] "Overall Error Rate"
## [1] 0.3631
## 
## Processing Naive Bayes Classifier tuning alpha = 0.04
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.9007092   0.9799179      0.8308411      0.9890258 0.8308411
## Class: 1   0.9581851   0.9549347      0.7291808      0.9944855 0.7291808
## Class: 2   0.4864048   0.9940047      0.8994413      0.9461059 0.8994413
## Class: 3   0.5812133   0.9840722      0.8059701      0.9537947 0.8059701
## Class: 4   0.3223819   0.9929094      0.8306878      0.9314072 0.8306878
## Class: 5   0.1683278   0.9945037      0.7524752      0.9233517 0.7524752
## Class: 6   0.9148073   0.9601731      0.7153053      0.9903879 0.7153053
## Class: 7   0.4243295   0.9956454      0.9190871      0.9368565 0.9190871
## Class: 8   0.6184615   0.8821053      0.3617277      0.9553582 0.3617277
## Class: 9   0.9193548   0.8582371      0.4166286      0.9897580 0.4166286
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.9007092 0.8643656     0.0987         0.0889
## Class: 1 0.9581851 0.8281430     0.1124         0.1077
## Class: 2 0.4864048 0.6313725     0.0993         0.0483
## Class: 3 0.5812133 0.6753837     0.1022         0.0594
## Class: 4 0.3223819 0.4644970     0.0974         0.0314
## Class: 5 0.1683278 0.2751131     0.0903         0.0152
## Class: 6 0.9148073 0.8028482     0.0986         0.0902
## Class: 7 0.4243295 0.5806029     0.1044         0.0443
## Class: 8 0.6184615 0.4564724     0.0975         0.0603
## Class: 9 0.9193548 0.5734046     0.0992         0.0912
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1070         0.9403136
## Class: 1               0.1477         0.9565599
## Class: 2               0.0537         0.7402047
## Class: 3               0.0737         0.7826427
## Class: 4               0.0378         0.6576457
## Class: 5               0.0202         0.5814157
## Class: 6               0.1261         0.9374902
## Class: 7               0.0482         0.7099874
## Class: 8               0.1667         0.7502834
## Class: 9               0.2189         0.8887960
## [1] "Overall Error Rate"
## [1] 0.3631
## 
## Processing Naive Bayes Classifier tuning alpha = 0.08
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.9007092   0.9799179      0.8308411      0.9890258 0.8308411
## Class: 1   0.9581851   0.9549347      0.7291808      0.9944855 0.7291808
## Class: 2   0.4864048   0.9940047      0.8994413      0.9461059 0.8994413
## Class: 3   0.5812133   0.9840722      0.8059701      0.9537947 0.8059701
## Class: 4   0.3223819   0.9929094      0.8306878      0.9314072 0.8306878
## Class: 5   0.1683278   0.9945037      0.7524752      0.9233517 0.7524752
## Class: 6   0.9148073   0.9601731      0.7153053      0.9903879 0.7153053
## Class: 7   0.4243295   0.9956454      0.9190871      0.9368565 0.9190871
## Class: 8   0.6184615   0.8821053      0.3617277      0.9553582 0.3617277
## Class: 9   0.9193548   0.8582371      0.4166286      0.9897580 0.4166286
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.9007092 0.8643656     0.0987         0.0889
## Class: 1 0.9581851 0.8281430     0.1124         0.1077
## Class: 2 0.4864048 0.6313725     0.0993         0.0483
## Class: 3 0.5812133 0.6753837     0.1022         0.0594
## Class: 4 0.3223819 0.4644970     0.0974         0.0314
## Class: 5 0.1683278 0.2751131     0.0903         0.0152
## Class: 6 0.9148073 0.8028482     0.0986         0.0902
## Class: 7 0.4243295 0.5806029     0.1044         0.0443
## Class: 8 0.6184615 0.4564724     0.0975         0.0603
## Class: 9 0.9193548 0.5734046     0.0992         0.0912
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1070         0.9403136
## Class: 1               0.1477         0.9565599
## Class: 2               0.0537         0.7402047
## Class: 3               0.0737         0.7826427
## Class: 4               0.0378         0.6576457
## Class: 5               0.0202         0.5814157
## Class: 6               0.1261         0.9374902
## Class: 7               0.0482         0.7099874
## Class: 8               0.1667         0.7502834
## Class: 9               0.2189         0.8887960
## [1] "Overall Error Rate"
## [1] 0.3631
## 
## Processing Naive Bayes Classifier tuning alpha = 0.16
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.9007092   0.9799179      0.8308411      0.9890258 0.8308411
## Class: 1   0.9581851   0.9549347      0.7291808      0.9944855 0.7291808
## Class: 2   0.4864048   0.9940047      0.8994413      0.9461059 0.8994413
## Class: 3   0.5812133   0.9840722      0.8059701      0.9537947 0.8059701
## Class: 4   0.3223819   0.9929094      0.8306878      0.9314072 0.8306878
## Class: 5   0.1683278   0.9945037      0.7524752      0.9233517 0.7524752
## Class: 6   0.9148073   0.9601731      0.7153053      0.9903879 0.7153053
## Class: 7   0.4243295   0.9956454      0.9190871      0.9368565 0.9190871
## Class: 8   0.6184615   0.8821053      0.3617277      0.9553582 0.3617277
## Class: 9   0.9193548   0.8582371      0.4166286      0.9897580 0.4166286
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.9007092 0.8643656     0.0987         0.0889
## Class: 1 0.9581851 0.8281430     0.1124         0.1077
## Class: 2 0.4864048 0.6313725     0.0993         0.0483
## Class: 3 0.5812133 0.6753837     0.1022         0.0594
## Class: 4 0.3223819 0.4644970     0.0974         0.0314
## Class: 5 0.1683278 0.2751131     0.0903         0.0152
## Class: 6 0.9148073 0.8028482     0.0986         0.0902
## Class: 7 0.4243295 0.5806029     0.1044         0.0443
## Class: 8 0.6184615 0.4564724     0.0975         0.0603
## Class: 9 0.9193548 0.5734046     0.0992         0.0912
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1070         0.9403136
## Class: 1               0.1477         0.9565599
## Class: 2               0.0537         0.7402047
## Class: 3               0.0737         0.7826427
## Class: 4               0.0378         0.6576457
## Class: 5               0.0202         0.5814157
## Class: 6               0.1261         0.9374902
## Class: 7               0.0482         0.7099874
## Class: 8               0.1667         0.7502834
## Class: 9               0.2189         0.8887960
## [1] "Overall Error Rate"
## [1] 0.3631

The preferred model for protocol 1 is Alpha = .01.

2nd Test Protocol Training set 40,000, Validation set 10,000, Test set 10,000, Alpha = .01, .02, .04, .08, .16.

#Setting seed to produce same results
set.seed(123)
#Dividing data with training having 40k, validation set having 10k, and testing having 10k.
trainIndex = createDataPartition(inTrain$y, p=0.666565, list=FALSE)
training = inTrain[trainIndex,]
inbet =  inTrain[-trainIndex,]
valIndex = createDataPartition(inbet$y, p=0.499745, list=FALSE)
validation = inbet[valIndex,]
#Creating model using Naive Bayes Classifier
i=.01
while (i <=.16)
{
  cat("Processing Naive Bayes Classifier tuning alpha = ",    i, "\n",sep = "")
  nb_model<- naiveBayes(y~.,data = training,laplace = i)
  results<-predict(nb_model,newdata=validation,
  type=c("class"))
  output  <-confusionMatrix(results,validation$y)
  print("Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)")
  print(output$byClass)
  print("Overall Error Rate")
  acc=output$overall['Accuracy']
  j=i*100
  proc2[j]=c(1-acc)
  print(proc2[j])
  cat("\n")
  i=i*2
}

## Processing Naive Bayes Classifier tuning alpha = 0.01
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.8986829   0.9792522      0.8258845      0.9887968 0.8258845
## Class: 1   0.9466192   0.9596665      0.7482419      0.9930054 0.7482419
## Class: 2   0.4179255   0.9936716      0.8792373      0.9393367 0.8792373
## Class: 3   0.5655577   0.9832925      0.7939560      0.9521139 0.7939560
## Class: 4   0.2659138   0.9942389      0.8327974      0.9262050 0.8327974
## Class: 5   0.1316372   0.9960422      0.7677419      0.9202641 0.7677419
## Class: 6   0.9249493   0.9447526      0.6468085      0.9913853 0.6468085
## Class: 7   0.4521073   0.9948638      0.9111969      0.9396752 0.9111969
## Class: 8   0.6964103   0.8780055      0.3814607      0.9639903 0.3814607
## Class: 9   0.9051463   0.8631369      0.4211268      0.9880559 0.4211268
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.8986829 0.8607472     0.0987         0.0887
## Class: 1 0.9466192 0.8358209     0.1124         0.1064
## Class: 2 0.4179255 0.5665529     0.0993         0.0415
## Class: 3 0.5655577 0.6605714     0.1022         0.0578
## Class: 4 0.2659138 0.4031128     0.0974         0.0259
## Class: 5 0.1316372 0.2247403     0.0904         0.0119
## Class: 6 0.9249493 0.7612688     0.0986         0.0912
## Class: 7 0.4521073 0.6043534     0.1044         0.0472
## Class: 8 0.6964103 0.4929220     0.0975         0.0679
## Class: 9 0.9051463 0.5748158     0.0991         0.0897
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1074         0.9389675
## Class: 1               0.1422         0.9531429
## Class: 2               0.0472         0.7057985
## Class: 3               0.0728         0.7744251
## Class: 4               0.0311         0.6300763
## Class: 5               0.0155         0.5638397
## Class: 6               0.1410         0.9348509
## Class: 7               0.0518         0.7234855
## Class: 8               0.1780         0.7872079
## Class: 9               0.2130         0.8841416
## [1] "Overall Error Rate"
## [1] 0.3718
## 
## Processing Naive Bayes Classifier tuning alpha = 0.02
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.8986829   0.9792522      0.8258845      0.9887968 0.8258845
## Class: 1   0.9466192   0.9596665      0.7482419      0.9930054 0.7482419
## Class: 2   0.4179255   0.9936716      0.8792373      0.9393367 0.8792373
## Class: 3   0.5655577   0.9832925      0.7939560      0.9521139 0.7939560
## Class: 4   0.2659138   0.9942389      0.8327974      0.9262050 0.8327974
## Class: 5   0.1316372   0.9960422      0.7677419      0.9202641 0.7677419
## Class: 6   0.9249493   0.9447526      0.6468085      0.9913853 0.6468085
## Class: 7   0.4521073   0.9948638      0.9111969      0.9396752 0.9111969
## Class: 8   0.6964103   0.8780055      0.3814607      0.9639903 0.3814607
## Class: 9   0.9051463   0.8631369      0.4211268      0.9880559 0.4211268
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.8986829 0.8607472     0.0987         0.0887
## Class: 1 0.9466192 0.8358209     0.1124         0.1064
## Class: 2 0.4179255 0.5665529     0.0993         0.0415
## Class: 3 0.5655577 0.6605714     0.1022         0.0578
## Class: 4 0.2659138 0.4031128     0.0974         0.0259
## Class: 5 0.1316372 0.2247403     0.0904         0.0119
## Class: 6 0.9249493 0.7612688     0.0986         0.0912
## Class: 7 0.4521073 0.6043534     0.1044         0.0472
## Class: 8 0.6964103 0.4929220     0.0975         0.0679
## Class: 9 0.9051463 0.5748158     0.0991         0.0897
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1074         0.9389675
## Class: 1               0.1422         0.9531429
## Class: 2               0.0472         0.7057985
## Class: 3               0.0728         0.7744251
## Class: 4               0.0311         0.6300763
## Class: 5               0.0155         0.5638397
## Class: 6               0.1410         0.9348509
## Class: 7               0.0518         0.7234855
## Class: 8               0.1780         0.7872079
## Class: 9               0.2130         0.8841416
## [1] "Overall Error Rate"
## [1] 0.3718
## 
## Processing Naive Bayes Classifier tuning alpha = 0.04
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.8986829   0.9792522      0.8258845      0.9887968 0.8258845
## Class: 1   0.9466192   0.9596665      0.7482419      0.9930054 0.7482419
## Class: 2   0.4179255   0.9936716      0.8792373      0.9393367 0.8792373
## Class: 3   0.5655577   0.9832925      0.7939560      0.9521139 0.7939560
## Class: 4   0.2659138   0.9942389      0.8327974      0.9262050 0.8327974
## Class: 5   0.1316372   0.9960422      0.7677419      0.9202641 0.7677419
## Class: 6   0.9249493   0.9447526      0.6468085      0.9913853 0.6468085
## Class: 7   0.4521073   0.9948638      0.9111969      0.9396752 0.9111969
## Class: 8   0.6964103   0.8780055      0.3814607      0.9639903 0.3814607
## Class: 9   0.9051463   0.8631369      0.4211268      0.9880559 0.4211268
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.8986829 0.8607472     0.0987         0.0887
## Class: 1 0.9466192 0.8358209     0.1124         0.1064
## Class: 2 0.4179255 0.5665529     0.0993         0.0415
## Class: 3 0.5655577 0.6605714     0.1022         0.0578
## Class: 4 0.2659138 0.4031128     0.0974         0.0259
## Class: 5 0.1316372 0.2247403     0.0904         0.0119
## Class: 6 0.9249493 0.7612688     0.0986         0.0912
## Class: 7 0.4521073 0.6043534     0.1044         0.0472
## Class: 8 0.6964103 0.4929220     0.0975         0.0679
## Class: 9 0.9051463 0.5748158     0.0991         0.0897
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1074         0.9389675
## Class: 1               0.1422         0.9531429
## Class: 2               0.0472         0.7057985
## Class: 3               0.0728         0.7744251
## Class: 4               0.0311         0.6300763
## Class: 5               0.0155         0.5638397
## Class: 6               0.1410         0.9348509
## Class: 7               0.0518         0.7234855
## Class: 8               0.1780         0.7872079
## Class: 9               0.2130         0.8841416
## [1] "Overall Error Rate"
## [1] 0.3718
## 
## Processing Naive Bayes Classifier tuning alpha = 0.08
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.8986829   0.9792522      0.8258845      0.9887968 0.8258845
## Class: 1   0.9466192   0.9596665      0.7482419      0.9930054 0.7482419
## Class: 2   0.4179255   0.9936716      0.8792373      0.9393367 0.8792373
## Class: 3   0.5655577   0.9832925      0.7939560      0.9521139 0.7939560
## Class: 4   0.2659138   0.9942389      0.8327974      0.9262050 0.8327974
## Class: 5   0.1316372   0.9960422      0.7677419      0.9202641 0.7677419
## Class: 6   0.9249493   0.9447526      0.6468085      0.9913853 0.6468085
## Class: 7   0.4521073   0.9948638      0.9111969      0.9396752 0.9111969
## Class: 8   0.6964103   0.8780055      0.3814607      0.9639903 0.3814607
## Class: 9   0.9051463   0.8631369      0.4211268      0.9880559 0.4211268
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.8986829 0.8607472     0.0987         0.0887
## Class: 1 0.9466192 0.8358209     0.1124         0.1064
## Class: 2 0.4179255 0.5665529     0.0993         0.0415
## Class: 3 0.5655577 0.6605714     0.1022         0.0578
## Class: 4 0.2659138 0.4031128     0.0974         0.0259
## Class: 5 0.1316372 0.2247403     0.0904         0.0119
## Class: 6 0.9249493 0.7612688     0.0986         0.0912
## Class: 7 0.4521073 0.6043534     0.1044         0.0472
## Class: 8 0.6964103 0.4929220     0.0975         0.0679
## Class: 9 0.9051463 0.5748158     0.0991         0.0897
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1074         0.9389675
## Class: 1               0.1422         0.9531429
## Class: 2               0.0472         0.7057985
## Class: 3               0.0728         0.7744251
## Class: 4               0.0311         0.6300763
## Class: 5               0.0155         0.5638397
## Class: 6               0.1410         0.9348509
## Class: 7               0.0518         0.7234855
## Class: 8               0.1780         0.7872079
## Class: 9               0.2130         0.8841416
## [1] "Overall Error Rate"
## [1] 0.3718
## 
## Processing Naive Bayes Classifier tuning alpha = 0.16
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.8986829   0.9792522      0.8258845      0.9887968 0.8258845
## Class: 1   0.9466192   0.9596665      0.7482419      0.9930054 0.7482419
## Class: 2   0.4179255   0.9936716      0.8792373      0.9393367 0.8792373
## Class: 3   0.5655577   0.9832925      0.7939560      0.9521139 0.7939560
## Class: 4   0.2659138   0.9942389      0.8327974      0.9262050 0.8327974
## Class: 5   0.1316372   0.9960422      0.7677419      0.9202641 0.7677419
## Class: 6   0.9249493   0.9447526      0.6468085      0.9913853 0.6468085
## Class: 7   0.4521073   0.9948638      0.9111969      0.9396752 0.9111969
## Class: 8   0.6964103   0.8780055      0.3814607      0.9639903 0.3814607
## Class: 9   0.9051463   0.8631369      0.4211268      0.9880559 0.4211268
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.8986829 0.8607472     0.0987         0.0887
## Class: 1 0.9466192 0.8358209     0.1124         0.1064
## Class: 2 0.4179255 0.5665529     0.0993         0.0415
## Class: 3 0.5655577 0.6605714     0.1022         0.0578
## Class: 4 0.2659138 0.4031128     0.0974         0.0259
## Class: 5 0.1316372 0.2247403     0.0904         0.0119
## Class: 6 0.9249493 0.7612688     0.0986         0.0912
## Class: 7 0.4521073 0.6043534     0.1044         0.0472
## Class: 8 0.6964103 0.4929220     0.0975         0.0679
## Class: 9 0.9051463 0.5748158     0.0991         0.0897
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1074         0.9389675
## Class: 1               0.1422         0.9531429
## Class: 2               0.0472         0.7057985
## Class: 3               0.0728         0.7744251
## Class: 4               0.0311         0.6300763
## Class: 5               0.0155         0.5638397
## Class: 6               0.1410         0.9348509
## Class: 7               0.0518         0.7234855
## Class: 8               0.1780         0.7872079
## Class: 9               0.2130         0.8841416
## [1] "Overall Error Rate"
## [1] 0.3718

Applying the preferred model picked from the validation set, to test set using Alpha = .01.

  #Creating model using Naive Bayes Calssifier
  cat("Processing Naive Bayes Classifier tuning alpha = \n")

## Processing Naive Bayes Classifier tuning alpha =

  nb_model<- naiveBayes(y~.,data = training,laplace = .01)
  results<-predict(nb_model,newdata=testing,type=c("class"))
  output  <-confusionMatrix(results,testing$y)
  print("Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)")

## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"

  print(output$byClass)

##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.8946302   0.9814712      0.8409524      0.9883799 0.8409524
## Class: 1   0.9599644   0.9554980      0.7320217      0.9947220 0.7320217
## Class: 2   0.4582075   0.9936716      0.8886719      0.9432968 0.8886719
## Class: 3   0.5831703   0.9837380      0.8032345      0.9539857 0.8032345
## Class: 4   0.2854209   0.9932417      0.8200590      0.9279578 0.8200590
## Class: 5   0.1273533   0.9956029      0.7419355      0.9199594 0.7419355
## Class: 6   0.9330629   0.9504105      0.6730066      0.9923549 0.6730066
## Class: 7   0.4636015   0.9948638      0.9132075      0.9408659 0.9132075
## Class: 8   0.6133333   0.8812188      0.3580838      0.9547419 0.3580838
## Class: 9   0.9183468   0.8612345      0.4215641      0.9896670 0.4215641
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.8946302 0.8669612     0.0987         0.0883
## Class: 1 0.9599644 0.8306390     0.1124         0.1079
## Class: 2 0.4582075 0.6046512     0.0993         0.0455
## Class: 3 0.5831703 0.6757370     0.1022         0.0596
## Class: 4 0.2854209 0.4234577     0.0974         0.0278
## Class: 5 0.1273533 0.2173913     0.0903         0.0115
## Class: 6 0.9330629 0.7819805     0.0986         0.0920
## Class: 7 0.4636015 0.6149936     0.1044         0.0484
## Class: 8 0.6133333 0.4521739     0.0975         0.0598
## Class: 9 0.9183468 0.5778624     0.0992         0.0911
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1050         0.9380507
## Class: 1               0.1474         0.9577312
## Class: 2               0.0512         0.7259395
## Class: 3               0.0742         0.7834541
## Class: 4               0.0339         0.6393313
## Class: 5               0.0155         0.5614781
## Class: 6               0.1367         0.9417367
## Class: 7               0.0530         0.7292327
## Class: 8               0.1670         0.7472761
## Class: 9               0.2161         0.8897906

  print("Overall Error Rate")

## [1] "Overall Error Rate"

  acc=output$overall['Accuracy']
  print(1-acc)

## Accuracy 
##   0.3681

  cat("\n")

3rd Protocol 5-fold cross-validation and 10-fold cross validation (average and standard deviation)

#Setting seed to produce same results
set.seed(123)
trainIndex = createDataPartition(inTrain$y, p=0.833245, list=FALSE)
training = inTrain[trainIndex,]
for (j in seq(5, 10, 5))
{
  #Creating model using Naive Bayes Classifier
  ctrl <- trainControl(method = "cv", number = 5)
  nb_model<- naiveBayes(y~.,data = training,ctrl)
  cat("Processing Naive Bayes using Cross Validation " ,j,           " Folds\n",sep = "")
  results<-predict(nb_model,newdata=testing,type=c("class"))
  output  <-confusionMatrix(results,testing$y)
  print("Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)")
  print(output$byClass)
  print("Overall Error Rate")
  acc=output$overall['Accuracy']
  proc3[j]=c(1-acc)
  print(proc3[j])
  cat("\n")
}

## Processing Naive Bayes using Cross Validation 5 Folds
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.9007092   0.9799179      0.8308411      0.9890258 0.8308411
## Class: 1   0.9581851   0.9549347      0.7291808      0.9944855 0.7291808
## Class: 2   0.4864048   0.9940047      0.8994413      0.9461059 0.8994413
## Class: 3   0.5812133   0.9840722      0.8059701      0.9537947 0.8059701
## Class: 4   0.3223819   0.9929094      0.8306878      0.9314072 0.8306878
## Class: 5   0.1683278   0.9945037      0.7524752      0.9233517 0.7524752
## Class: 6   0.9148073   0.9601731      0.7153053      0.9903879 0.7153053
## Class: 7   0.4243295   0.9956454      0.9190871      0.9368565 0.9190871
## Class: 8   0.6184615   0.8821053      0.3617277      0.9553582 0.3617277
## Class: 9   0.9193548   0.8582371      0.4166286      0.9897580 0.4166286
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.9007092 0.8643656     0.0987         0.0889
## Class: 1 0.9581851 0.8281430     0.1124         0.1077
## Class: 2 0.4864048 0.6313725     0.0993         0.0483
## Class: 3 0.5812133 0.6753837     0.1022         0.0594
## Class: 4 0.3223819 0.4644970     0.0974         0.0314
## Class: 5 0.1683278 0.2751131     0.0903         0.0152
## Class: 6 0.9148073 0.8028482     0.0986         0.0902
## Class: 7 0.4243295 0.5806029     0.1044         0.0443
## Class: 8 0.6184615 0.4564724     0.0975         0.0603
## Class: 9 0.9193548 0.5734046     0.0992         0.0912
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1070         0.9403136
## Class: 1               0.1477         0.9565599
## Class: 2               0.0537         0.7402047
## Class: 3               0.0737         0.7826427
## Class: 4               0.0378         0.6576457
## Class: 5               0.0202         0.5814157
## Class: 6               0.1261         0.9374902
## Class: 7               0.0482         0.7099874
## Class: 8               0.1667         0.7502834
## Class: 9               0.2189         0.8887960
## [1] "Overall Error Rate"
## [1] 0.3631
## 
## Processing Naive Bayes using Cross Validation 10 Folds
## [1] "Prior(Y) Maximum Likelihood Estimate (Prevalence Detection & Maximum a Priori (Balanced Accuracy)"
##          Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
## Class: 0   0.9007092   0.9799179      0.8308411      0.9890258 0.8308411
## Class: 1   0.9581851   0.9549347      0.7291808      0.9944855 0.7291808
## Class: 2   0.4864048   0.9940047      0.8994413      0.9461059 0.8994413
## Class: 3   0.5812133   0.9840722      0.8059701      0.9537947 0.8059701
## Class: 4   0.3223819   0.9929094      0.8306878      0.9314072 0.8306878
## Class: 5   0.1683278   0.9945037      0.7524752      0.9233517 0.7524752
## Class: 6   0.9148073   0.9601731      0.7153053      0.9903879 0.7153053
## Class: 7   0.4243295   0.9956454      0.9190871      0.9368565 0.9190871
## Class: 8   0.6184615   0.8821053      0.3617277      0.9553582 0.3617277
## Class: 9   0.9193548   0.8582371      0.4166286      0.9897580 0.4166286
##             Recall        F1 Prevalence Detection Rate
## Class: 0 0.9007092 0.8643656     0.0987         0.0889
## Class: 1 0.9581851 0.8281430     0.1124         0.1077
## Class: 2 0.4864048 0.6313725     0.0993         0.0483
## Class: 3 0.5812133 0.6753837     0.1022         0.0594
## Class: 4 0.3223819 0.4644970     0.0974         0.0314
## Class: 5 0.1683278 0.2751131     0.0903         0.0152
## Class: 6 0.9148073 0.8028482     0.0986         0.0902
## Class: 7 0.4243295 0.5806029     0.1044         0.0443
## Class: 8 0.6184615 0.4564724     0.0975         0.0603
## Class: 9 0.9193548 0.5734046     0.0992         0.0912
##          Detection Prevalence Balanced Accuracy
## Class: 0               0.1070         0.9403136
## Class: 1               0.1477         0.9565599
## Class: 2               0.0537         0.7402047
## Class: 3               0.0737         0.7826427
## Class: 4               0.0378         0.6576457
## Class: 5               0.0202         0.5814157
## Class: 6               0.1261         0.9374902
## Class: 7               0.0482         0.7099874
## Class: 8               0.1667         0.7502834
## Class: 9               0.2189         0.8887960
## [1] "Overall Error Rate"
## [1] 0.3631

The 5 Fold cross validation and the 10 Fold validation approaches has an equivalent average error, and interestingly enough, a similar error to protocol 1. The Fold validations are a bit lower (.005).
The generated best is cross validation 10 fold, containing the ideal lowest error and less variability. If we are interested in least amount of variability within our model and data, then we can choose the 10 Fold approach in this case.

Below is a graph of Alphas and their Error rates according to protocol 1, protocol 2,and protocol 3.

i=c(1,2,4,8,16)
j=c(5,10)
proc1=na.omit(proc1)
proc2=na.omit(proc2)
proc3=na.omit(proc3)
plot(i,proc1,type="b",col="red",main = "Error Rates as Alpha increases",xlab = "Alpha %",ylab = "Error Rate",ylim = c(0,.40), xlim = c(0,16))
lines(i,proc2,col="green")
lines(j,proc3,col="black")

Protocol 1 is RED, Protocol 2 is GREEN, and Protocol 3 is BLACK.

We will verifying the Final Model below trying to predict an image.

#Best model selected
  ctrl <- trainControl(method = "cv", number = 10)
  nb_model<- naiveBayes(y~.,data = training,ctrl)
#Drawing a digit.
displaydigit(as.matrix(training[5,2:785]))

#Predicting the digit.
print("The predicted digit is:")

## [1] "The predicted digit is:"

predict(nb_model, newdata = training[5,])

## [1] 9
## Levels: 0 1 2 3 4 5 6 7 8 9

#Verifying the answer for the digit.
print("The actual digit is:")

## [1] "The actual digit is:"

training[5,1]

## [1] 9
## Levels: 0 1 2 3 4 5 6 7 8 9

Between KNN and Naive Bayes, KNN produces better results. I did not pre-process the data, and because Naive Bayes is much more efficient, I used the entire dataset as opposed to taking a random sample for the sets, like I did in the previous assignment.
Going into specifics, KNN classifier is a supervised lazy classifier which has local heuristics. Being a lazy classifier, it is difficult to use this for prediction in real time. The decision boundaries you achieve with K-NN are much more complex than any decision trees, thus obtaining a nice classification. When you are solving a problem which directly focusses on finding similarity between observations like the MNIST dataset, KNN does better because of its inherent nature to optimize locally. This is also a flipside because, outliers can significantly kill the performance. Additionally, KNN is most likely to overfit, and we must adjust k to maximize the test set performance. As the complexity of the space grows, the accuracy of KNN comes down and we would need more data, but the order of this classifier is n^2 and it becomes too slow. A dimensionality reduction technique like PCA or SVD is typically applied and then this classifier is used.

Naive Bayes is an eager learning classifier and it is much faster than KNN. Thus, it could be used for prediction in real time. Typically, email spam filtering uses Naive Bayes classifier. It takes a probabilistic estimation route and generates probabilities for each class. It assumes conditional independence between the features and uses a maximum likelihood hypothesis. The best part with this classifier is that, it learns over time. In spam filtering, the type of spam words in email evolves over time. In the same way, the classifier also calculates probability estimates for the newly occuring spam words in a sort of bag of words model, and makes sure it performs well. This feature of the classifier is due to the inherent nature of the algorithm being generative but not discriminative. KNN classifier can be used when your data set is small enough, so that KNN completes running in a shorter time. In general, the complexity of KNN Classifier in Big O notation is n squared where n is the number of data points. So, when the data size increases, plain KNN without any hacks usually becomes useless. KNN based on KDTree may help a bit, still the best one can do is log(n). NaiveBayes learning is just counting and calculating the probabilities. The complexity for applying a Naive bayes model is O(1). KNN might work best for your data, when data is small, but when you know your data size will keep increasing its wise to choose Naive Bayes. The KNN may sometimes be a good choice in some settings, especially if the data is low dimensional, but it is rarely appropriate for use in practical image classification settings. The problem is that images are usually high dimensional objects, and often contain many pixels. The distances over high dimensional spaces can be very counter intuitive. The same image could be shifted, darkened or altered and the pixel wise distance does not correspond at all to perceptual or semantic similarity. Unlike KNN, using Naive Bayes with the MNIST dataset did not produce overfitting as the dataset seems to be evenly distributed. When using the validation set for the model, the error rate increased, but it lead to picking the same alpha. The alphas also did not seem to affect the error rate or reduce overfitting.

*www.quora.com