Math 645 Final Project

The problem:

In this problem we take 7 data attributes of bananas, namely size, weight, sweetness, softness, harvest time, ripeness, and acidity, as predictor variables to determine whether a given banana in the banana_quality dataset should be classified as having good or bad quality; hence the name “Banana Split”. Being able to perform such classification provides obvious benefits. To those who cultivate bananas, they would become more informed on what attributes lead to a banana of high quality. Such classification could also benefit decision making for those who buy or sell bananas, and allow those who make products using bananas to find bananas of the highest quality possible.

Our goal is to reach the highest accuracy rate possible in testing, preferably attaining an accuracy rate of at least 95% in classifying banans as having a “good” or “bad” quality..

Motivation:

Classification based on data attributes is commonly needed in multiple real-world scenarios. In project three we compared various classical methods of classification, including Logistic Regression, as well as generative models such as Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Naive Bayes (NB), and K-Nearest Neighbors (KNN), to categorize wine into one of three types based on various predictors given in the wine.data dataset. In that project, I was able to achieve a maximum prediction accuracy of 94.47% using KNN with k=9 nearest neighbors. In later projects we performed classification using modern methods such as Classification Trees, Bagging Classification Trees, and Classification via Support Vector Machines.

I sought a project that would allow me to perform each of these methods, both classical and modern, in seeking an optimum classification. I also prefered a project with two output classes, wherein each method described above could be an appropriate solution approach. The importance of a classification such as this can be found in the description of the problem above.

Examining the Data: banana_quality

This dataset, “banana_quality” is freely available for download on the data science website Kaggle. Included in the dataset are 8000 samples of data characteristics of bananas at approximately a 50%-50% split in terms of good or bad quality. The dependent variable “Quality” is a factor, while there are seven numeric predictor variables: Size, Weight, Sweetness, Softness, Harvest Time, Ripeness, and Acidity. Below, we show a summary of key information for variables in the banana_quality dataset, as well as the first ten samples of the dataset. Note: The data in this dataset has been previously scaled. For this reason typically positive values such as sizes may appear to be negative. Finally, I add a correlation plot, including all variables. I treat “Quality” as a binary numeric variable (Good=0, Bad=1). Though this is not a true correlation in the way the six numeric variables are, the listed “correlation” shows impact of predictor variables on the classification output in a manner that will inform variable selection.

Technical note: I created a copied data frame of the original dataset, so that I could treat “Quality” as a numeric variable in numeric contexts, yet still call on “Quality as a factor when desired in later applications.

banana_quality <- read.csv("/Users/shaun.bardell/Desktop/banana_quality.csv")
banana_quality$Quality <- as.factor(banana_quality$Quality)

Summary of Predictor Variables:
	Size	Weight	Sweetness	Softness	HarvestTime	Ripeness	Acidity
Min.	-7.9980736	-8.2830020	-6.4340215	-6.9593196	-7.5700083	-7.4231553	-8.2269770
1st Qu.	-2.2776507	-2.2235744	-2.1073294	-1.5904583	-2.1206587	-0.5742256	-1.6294500
Median	-0.8975140	-0.8686590	-1.0206731	0.2026440	-0.9341920	0.9649517	0.0987351
Mean	-0.7478018	-0.7610194	-0.7702241	-0.0144409	-0.7512883	0.7810984	0.0087251
3rd Qu.	0.6542161	0.7754915	0.3110480	1.5471202	0.5073260	2.2616505	1.6820629
Max.	7.9708004	5.6796920	7.5393740	8.2415550	6.2932800	7.2490335	7.4116335

Banana Quality Frequency:
Bad	3994
Good	4006

First ten rows of the banana-quality dataset:
Size	Weight	Sweetness	Softness	HarvestTime	Ripeness	Acidity	Quality
-1.9249682	0.4680781	3.0778325	-1.4721768	0.2947986	2.435570	0.2712903	Good
-2.4097514	0.4868699	0.3469214	-2.4950993	-0.8922133	2.067549	0.3073251	Good
-0.3576066	1.4831762	1.5684522	-2.6451454	-0.6472673	3.090643	1.4273220	Good
-0.8685235	1.5662014	1.8896049	-1.2737614	-1.0062776	1.873001	0.4778617	Good
0.6518252	1.3191992	-0.0224590	-1.2097088	-1.4306920	1.078345	2.8124418	Good
-2.8077223	1.1381357	3.4476268	-1.7133021	-2.2209115	2.079410	2.2812028	Good
-0.2302080	2.7834713	1.6811839	-0.5297785	-1.9584678	1.348143	2.1817663	Good
-1.3485153	3.2322812	4.0118165	-0.8906063	-0.0319940	2.395917	1.0428779	Good
-2.0122256	1.9280338	0.6987464	-0.9597719	-1.3497207	1.311802	1.0487620	Good
0.0530348	1.3099926	-0.2641394	-2.9692972	0.3039835	3.889359	1.9313319	Good

Correlation Plot

Establishing a Train and Test Set

In the code below I divide the dataset into an approximately 70% training set and a 30% testing set. I use the random sample of train and test to create train and test data sets for both versions of the dataset, the one with ‘Qualtiy’ as a categorical variable and the one with ‘Quality’ as a numeric binomial variable.

set.seed(888)
train <- sample(c(TRUE, FALSE), nrow(banana_quality),replace=TRUE,
                prob=c(0.7,0.3))
banana_train <- banana_quality[train, ]
dim(banana_train)

## [1] 5564    8

5564/8000

## [1] 0.6955

banana_test <- banana_quality[!train, ]
dim(banana_test)

## [1] 2436    8

2436/8000

## [1] 0.3045

## Duplicate version with binomial numeric variable for Quality. 
banana_copy.train <- banana_copy[train, ]
banana_copy.test <- banana_copy[!train, ]

Classiscal Approaches and Analysis

We begin by testing more classical approaches to categorization, namely Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes, and K-Nearest Neighbors Classifications.

Logistic Regression

From the correlation shown above, it appears Acidity and Softness are the two predictor variables with least affect on banana_quality. I create a logistic regression using the all but these two variables. I also create a logistic regression using all 7 variables to as to compare differences between using 5 and 7 predictor variables.

lg.fit.banana <- glm(Quality ~ Weight+Sweetness+Ripeness+HarvestTime+Size, data = banana_copy.train,
                  family = binomial)
summary(lg.fit.banana)

## 
## Call:
## glm(formula = Quality ~ Weight + Sweetness + Ripeness + HarvestTime + 
##     Size, family = binomial, data = banana_copy.train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.67913    0.07472  -22.47   <2e-16 ***
## Weight      -0.92760    0.03449  -26.90   <2e-16 ***
## Sweetness   -0.78023    0.03287  -23.73   <2e-16 ***
## Ripeness    -0.62041    0.02790  -22.24   <2e-16 ***
## HarvestTime -0.56773    0.03057  -18.57   <2e-16 ***
## Size        -0.68453    0.03002  -22.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7713.3  on 5563  degrees of freedom
## Residual deviance: 3307.8  on 5558  degrees of freedom
## AIC: 3319.8
## 
## Number of Fisher Scoring iterations: 6

lg.fit.banana.full<- glm(Quality ~ Acidity+Weight+Sweetness+Ripeness+HarvestTime+Size+Softness, data = banana_copy.train,
                  family = binomial)
summary(lg.fit.banana.full)

## 
## Call:
## glm(formula = Quality ~ Acidity + Weight + Sweetness + Ripeness + 
##     HarvestTime + Size + Softness, family = binomial, data = banana_copy.train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.72570    0.07639 -22.590  < 2e-16 ***
## Acidity      0.12326    0.02340   5.267 1.39e-07 ***
## Weight      -1.00011    0.03711 -26.948  < 2e-16 ***
## Sweetness   -0.78991    0.03338 -23.666  < 2e-16 ***
## Ripeness    -0.59410    0.03011 -19.732  < 2e-16 ***
## HarvestTime -0.56608    0.03181 -17.795  < 2e-16 ***
## Size        -0.67326    0.03030 -22.222  < 2e-16 ***
## Softness    -0.07982    0.02337  -3.415 0.000638 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7713.3  on 5563  degrees of freedom
## Residual deviance: 3257.9  on 5556  degrees of freedom
## AIC: 3273.9
## 
## Number of Fisher Scoring iterations: 6

We compute the accuracy of both logistic regression models on the test dataset.

## Prediction with 5 predictors
lg.probs.banana <- predict(lg.fit.banana, banana_copy.test)
summary(lg.probs.banana)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -10.11060  -2.76326  -0.07732   0.03889   2.80094  13.26194

lg.pred <- rep("0", dim(banana_copy.test)[1])
lg.pred[1:10]

##  [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"

lg.pred[lg.probs.banana > 0.5] <- "1"
lg.pred[1:10]

##  [1] "0" "0" "1" "0" "0" "0" "0" "0" "0" "0"

length(lg.pred)

## [1] 2436

table(lg.pred, banana_copy.test$Quality)

##        
## lg.pred    0    1
##       0 1142  203
##       1   87 1004

result <- table(predict=lg.pred,truth=banana_copy.test$Quality)
result

##        truth
## predict    0    1
##       0 1142  203
##       1   87 1004

sum(diag(result))/sum(result)

## [1] 0.8809524

## Prediction with all 7 predictors
lg.probs.ban.full <- predict(lg.fit.banana, banana_copy.test)
summary(lg.probs.ban.full)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -10.11060  -2.76326  -0.07732   0.03889   2.80094  13.26194

lg.pred.full <- rep("0", dim(banana_copy.test)[1])
lg.pred.full[1:10]

##  [1] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"

lg.pred.full[lg.probs.ban.full > 0.5] <- "1"
lg.pred.full[1:10]

##  [1] "0" "0" "1" "0" "0" "0" "0" "0" "0" "0"

length(lg.pred.full)

## [1] 2436

table(lg.pred.full, banana_copy.test$Quality)

##             
## lg.pred.full    0    1
##            0 1142  203
##            1   87 1004

result.full <- table(predict=lg.pred.full,truth=banana_copy.test$Quality)
result.full

##        truth
## predict    0    1
##       0 1142  203
##       1   87 1004

sum(diag(result.full))/sum(result.full)

## [1] 0.8809524

Analysis of Model Performance: Logistic Regression

The logistic regression models with 5 and 7 predictors attained precisely the same predictions. This suggests that the extra two variables are having no effect on the value predicted, as expected based on the correlation values for the binary numeric variable shown above. In each case, an 88.1% prediction accuracy is attained. In subsequent models, we will choose the 5 variable predictors indicated above, omitting Acidity and Softness as predictors.

Linear Discriminant Analysis

Next we perform Linear Discriminant Analysis using the five predictor variables indicated above.

lda.fit <- lda(Quality ~ Weight+Sweetness+Ripeness+HarvestTime+Size, data = banana_copy.train)
lda.fit

## Call:
## lda(Quality ~ Weight + Sweetness + Ripeness + HarvestTime + Size, 
##     data = banana_copy.train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4991014 0.5008986 
## 
## Group means:
##         Weight   Sweetness   Ripeness HarvestTime         Size
## 0  0.007125373 -0.04588695 1.51509463  -0.0351919 -0.007334507
## 1 -1.540834275 -1.50806260 0.06108851  -1.4847003 -1.495996198
## 
## Coefficients of linear discriminants:
##                    LD1
## Weight      -0.3539055
## Sweetness   -0.3311260
## Ripeness    -0.2536426
## HarvestTime -0.2322707
## Size        -0.3346565

lda.pred <- predict(lda.fit,banana_copy.test)
names(lda.pred)

## [1] "class"     "posterior" "x"

lda.quality <- lda.pred$class
bctest_qual=banana_copy.test$Quality
table(lda.quality, bctest_qual)

##            bctest_qual
## lda.quality    0    1
##           0 1100  148
##           1  129 1059

mean(lda.quality == bctest_qual)

## [1] 0.886289

Analysis of Model Performance: Linear Discriminant Analysis

The Linear Discriminant Analysis with 5 predictors yields an accuracy rate of 88.6%. This accuracy rate is slightly, though not significantly improved from the accuracy rate of the Logistic Regression Model.

Quadratic Discriminant Analysis

Next we perform Quadratic Discriminant Analysis using the five predictor variables indicated above.

qda.fit <- qda(Quality ~ Weight+Sweetness+Ripeness+HarvestTime+Size, data = banana_copy.train)
qda.fit

## Call:
## qda(Quality ~ Weight + Sweetness + Ripeness + HarvestTime + Size, 
##     data = banana_copy.train)
## 
## Prior probabilities of groups:
##         0         1 
## 0.4991014 0.5008986 
## 
## Group means:
##         Weight   Sweetness   Ripeness HarvestTime         Size
## 0  0.007125373 -0.04588695 1.51509463  -0.0351919 -0.007334507
## 1 -1.540834275 -1.50806260 0.06108851  -1.4847003 -1.495996198

qda.pred <- predict(lda.fit,banana_copy.test)
names(qda.pred)

## [1] "class"     "posterior" "x"

qda.quality <- qda.pred$class
table(qda.quality, bctest_qual)

##            bctest_qual
## qda.quality    0    1
##           0 1100  148
##           1  129 1059

mean(qda.quality == bctest_qual)

## [1] 0.886289

Analysis of Model Performance: Quadratic Discriminant Analysis

The Quadratic Discriminant Analysis with 5 predictors yields an accuracy rate of 88.6%. This accuracy rate is slightly, though not significantly improved from the accuracy rate of the Logistic Regression Model. This accuracy is precisely the same as Linear Discriminant Analysis. This suggests that a quadratic model is not neccesary in this data context as compared to a linear model.

Naive Bayes

Next we build a Naive Bayes Model for the dataset and analyze the accuracy rate of predictions.

nb.fit <- naiveBayes(Quality ~ Weight+Sweetness+Ripeness+HarvestTime+Size, data = banana_copy.train)
nb.fit

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.4991014 0.5008986 
## 
## Conditional probabilities:
##    Weight
## Y           [,1]     [,2]
##   0  0.007125373 1.950351
##   1 -1.540834275 1.719219
## 
##    Sweetness
## Y          [,1]     [,2]
##   0 -0.04588695 2.188256
##   1 -1.50806260 1.324101
## 
##    Ripeness
## Y         [,1]     [,2]
##   0 1.51509463 1.663352
##   1 0.06108851 2.261098
## 
##    HarvestTime
## Y         [,1]     [,2]
##   0 -0.0351919 1.971028
##   1 -1.4847003 1.701550
## 
##    Size
## Y           [,1]     [,2]
##   0 -0.007334507 2.181861
##   1 -1.495996198 1.784302

nb.quality <- predict(nb.fit , banana_copy.test)
table(nb.quality, bctest_qual)

##           bctest_qual
## nb.quality    0    1
##          0 1114  137
##          1  115 1070

mean(nb.quality == bctest_qual)

## [1] 0.8965517

Analysis of Model Performance: Naive Bayes

The Naive Bayes Model based on 5 predictors yields an accuracy rate of 89.7%. This accuracy rate is an improvement on the Logistic Regression Model as well as both the Linear Discriminant Analysis and Quadratic Discriminant Analysis Models, by more than a full percentage point. This is the most accurate model that we have tested thus far.

K-Nearest Neighbors (KNN)

Next we apply K-nearest neighbors and apply a loop to find the K-value of the best prediction on the test set. We will report the k-value and accuracy of the best prediction by this method.

attach(banana_copy)
train.X=cbind(Weight,Sweetness,Ripeness,HarvestTime,Size)[train, ]
test.X=cbind(Weight,Sweetness,Ripeness,HarvestTime,Size)[!train, ]
Train.Class=banana_copy.train$Quality
## We create an empty vector to store accuracy rates of prediction by k-value.
mean_vector <- c()
## For-loop, K nearest neighbor predictions, k=1 to 30, with accuracy rates
for (n in 1:50){
  knn.pred=knn(train.X,test.X,Train.Class,k=n)
  mean_k <- mean(knn.pred == bctest_qual)
  mean_vector <- c(mean_vector,mean_k)
  cat('\n')
}

which.max(mean_vector) ## k value yielding maximum accuracy rate.

## [1] 34

max(mean_vector) ##  associated maximum accuracy rate for this k-value.

## [1] 0.933087

Analysis of Model Performance: K-Nearest Neighbors

K-Nearest Neighbors yields a maximum accuracy rate for K=34 neighbors. The accuracy rate is 93.3% for KNN with K=34 Neighbors. This is a 3.7% improvement over the best previous mode, Naive Bayes. K-nearest neighbors serves as the best classification model of the five classification models tested.

Modern Approaches and Analysis

We now apply modern approaches to categorization, namely Classification Trees, Bagging Classification Trees, Random Forest, and Support Vector Machine Classification.

Classification Tree

We will now build a classification tree for the banana_quality dataset.

banana.tree=tree(Quality~.,banana_train)
summary(banana.tree)

## 
## Classification tree:
## tree(formula = Quality ~ ., data = banana_train)
## Number of terminal nodes:  17 
## Residual mean deviance:  0.5085 = 2821 / 5547 
## Misclassification error rate: 0.08735 = 486 / 5564

plot(banana.tree,col="grey26")
title("Classification Tree: Banana Quality",col.main="navy")
text(banana.tree, pretty=0,cex=0.5,font=2)

banana.tree

node), split, n, deviance, yval, (yprob) * denotes terminal node

root 5564 7713.00 Bad ( 0.50090 0.49910 )
Sweetness < 0.283554 4170 5559.00 Bad ( 0.61487 0.38513 )
4) HarvestTime < 0.576851 2980 3263.00 Bad ( 0.76309 0.23691 )
1. Ripeness < 0.258735 1287 508.90 Bad ( 0.95027 0.04973 )
2. Size < 1.20259 1170 271.70 Bad ( 0.97521 0.02479 ) *
3. Size > 1.20259 117 142.80 Bad ( 0.70085 0.29915 ) *
4. Ripeness > 0.258735 1693 2247.00 Bad ( 0.62079 0.37921 )
5. Weight < 0.210609 1300 1492.00 Bad ( 0.73923 0.26077 )
  36) Size < 0.216993 963 712.40 Bad ( 0.87850 0.12150 )
  1. Softness < 2.7257 929 577.70 Bad ( 0.90635 0.09365 ) *
  2. Softness > 2.7257 34 24.63 Good ( 0.11765 0.88235 ) *
    1. Size > 0.216993 337 432.60 Good ( 0.34125 0.65875 )
  3. Softness < -0.442309 120 108.10 Bad ( 0.83333 0.16667 ) *
  4. Softness > -0.442309 217 109.10 Good ( 0.06912 0.93088 ) *
6. Weight > 0.210609 393 422.90 Good ( 0.22901 0.77099 )
  38) Softness < -0.135802 313 206.60 Good ( 0.10224 0.89776 )
  1. Acidity < -1.05168 51 69.74 Bad ( 0.56863 0.43137 ) *
  2. Acidity > -1.05168 262 32.78 Good ( 0.01145 0.98855 ) *
    1. Softness > -0.135802 80 94.11 Bad ( 0.72500 0.27500 ) *
    2. HarvestTime > 0.576851 1190 1322.00 Good ( 0.24370 0.75630 )
7. Softness < -1.79531 137 153.50 Bad ( 0.75182 0.24818 ) *
8. Softness > -1.79531 1053 985.00 Good ( 0.17759 0.82241 )
9. Ripeness < -1.27999 178 227.50 Bad ( 0.66292 0.33708 )
  44) Size < 0.731693 104 56.41 Bad ( 0.92308 0.07692 ) 45) Size > 0.731693 74 90.07 Good ( 0.29730 0.70270 )
10. Ripeness > -1.27999 875 482.90 Good ( 0.07886 0.92114 ) *
Sweetness > 0.283554 1394 1226.00 Good ( 0.15997 0.84003 )
6) Softness < 1.15717 1114 595.80 Good ( 0.07540 0.92460 )
1. Weight < -0.189172 159 214.30 Good ( 0.40252 0.59748 ) *
2. Weight > -0.189172 955 194.20 Good ( 0.02094 0.97906 ) *
  1. Softness > 1.15717 280 388.10 Good ( 0.49643 0.50357 )
3. Size < -1.60508 143 108.20 Bad ( 0.87413 0.12587 ) *
4. Size > -1.60508 137 90.38 Good ( 0.10219 0.89781 ) *

tree.pred <- predict( banana.tree , banana_test ,type="class")
head(tree.pred)

## [1] Good Good Bad  Good Good Good
## Levels: Bad Good

table.tp <- table(tree.pred,banana_test$Quality)
table.tp

##          
## tree.pred  Bad Good
##      Bad  1093  127
##      Good  114 1102

sum(diag(table.tp))/sum(table.tp)

## [1] 0.9010673

cv.tree <- cv.tree(banana.tree,FUN=prune.misclass)
cv.tree

$size [1] 17 15 14 13 12 11 9 8 7 6 3 2 1

$dev [1] 573 573 567 619 624 676 749 830 910 993 1169 1876 2869

$k [1] -Inf 0.0000 7.0000 26.0000 30.0000 36.0000 53.5000 58.0000 [9] 69.0000 80.0000 106.6667 610.0000 948.0000

$method [1] “misclass”

attr(,“class”) [1] “prune” “tree.sequence”

par(mfrow=c(1,1))
plot(cv.tree$size,cv.tree$dev,type="b", main="Error Rate by Tree Size",
     col.main="navy")
points(cv.tree$size[1],cv.tree$dev[1],col="navy",cex=2,pch=20)

Analysis of Model Performance:Classification Tree

We created a classification tree, and used the tree to predict the classes of banana quality. The classification using all 17 possible nodes yieled a 90.1% accuracy. We verified, via cross-validation, that the tree with the maximum number of nodes, n=17, is the best classification tree. This level of accuracy surpassed every classical method that we employed other than K-Nearest Neighbors (K=34) on the testing set.

Bagging Classification Tree

Next we use the Random Forrest Library to build a Bagging Classification Tree. We then form a prediction and find the corresponding accuracy rate.

bag.train <- randomForest(Quality ~.,data = banana_train,mtry = ncol(banana_train)-1,importance=T)
bag.train

## 
## Call:
##  randomForest(formula = Quality ~ ., data = banana_train, mtry = ncol(banana_train) -      1, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 3.38%
## Confusion matrix:
##       Bad Good class.error
## Bad  2702   85  0.03049874
## Good  103 2674  0.03709039

bag.train.pred <- predict(bag.train,banana_test,type="class")
head(bag.train.pred)

##    5   10   15   16   17   19 
## Good Good Good Good Good Good 
## Levels: Bad Good

summary(bag.train.pred)

##  Bad Good 
## 1210 1226

table.bag <- table(bag.train.pred,banana_test$Quality)
table.bag

##               
## bag.train.pred  Bad Good
##           Bad  1164   46
##           Good   43 1183

## Testing accuracy:
sum(diag(table.bag))/sum(table.bag)

## [1] 0.9634647

Analysis of Model Performance:Bagging Classification Tree

Applying a Bagging Classification Tree to the dataset, we were able to attain a prediction accuracy of 96.3%. This is our best model so far, exceeding all classical and modern classification models used previously. The performance of this model exceeds our stated goal of finding a model with 95% accuracy or more.

Random Forest

We next apply a loop with Random Forests to try to attain the maximum prediction accuracy possible.

num_preds <- seq(1,7,by=1)
accuracy <- c()
for(i in num_preds) {
  rf.train <- randomForest(Quality~.,data=banana_train,mtry=i,importance=T)
  rf.train.pred <- predict(rf.train,banana_test,type="class")
  table.rf <- table(rf.train.pred,banana_test$Quality)
  accuracy <- c(accuracy,sum(diag(table.rf))/sum(table.rf))
}

plot(num_preds,accuracy,type="b",main="Accuracy by Number of Predictors",
     xlab="Number of Predictors: m",ylab="Accuracy",col="grey26",
     col.main="navy")
which.max(accuracy)

[1] 1

best_m <- which.max(accuracy)
best_m

[1] 1

accuracy[1]

[1] 0.9683908

points(num_preds[1],accuracy[1],col="navy",cex=2,pch=20)

Analysis of Model Performance:Random Forest Classification

Applying a Random Forest Classification to the dataset, we were able to attain a prediction accuracy of 96.9% using only one predictor at each node split. This surpasses the accuracy of the Bagging Classification above as our most accurate prediction on the test set thus far. Notice that the plot shown above might be slightly misleading, as every number of predictors, m, yielded an accuracy between 96% and 97% accurate and the decrease shown above from left to right is actually only a small drop in overall percent accuracy.

Support Vector Machine Classification

The ultimate Modern classification method we apply to the dataset is classification of Banana quality via Support Vector Machines. I tune the function to find a best model using first a linear kernel, and subsequently a radial kernel.

tune.out.lin <- tune(svm,Quality ~.,data=banana_train,kernel="linear",
                 ranges=list(cost=c(0.001,0.01,0.1,1,2,3,4,5,6,7,8,9,10,100)))
summary(tune.out.lin)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.1263478 
## 
## - Detailed performance results:
##     cost     error dispersion
## 1  1e-03 0.1310221 0.01091941
## 2  1e-02 0.1272464 0.01209270
## 3  1e-01 0.1272455 0.01240378
## 4  1e+00 0.1263478 0.01290333
## 5  2e+00 0.1267069 0.01312482
## 6  3e+00 0.1267069 0.01312482
## 7  4e+00 0.1265270 0.01333938
## 8  5e+00 0.1265270 0.01333938
## 9  6e+00 0.1265270 0.01333938
## 10 7e+00 0.1265270 0.01333938
## 11 8e+00 0.1265270 0.01333938
## 12 9e+00 0.1265270 0.01333938
## 13 1e+01 0.1267069 0.01312482
## 14 1e+02 0.1270656 0.01258293

bestmod.lin <- tune.out.lin$best.model
summary(bestmod.lin)

## 
## Call:
## best.tune(METHOD = svm, train.x = Quality ~ ., data = banana_train, 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 2, 3, 4, 5, 6, 7, 
##         8, 9, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  1718
## 
##  ( 859 859 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Bad Good

## Prediction with SVM-linear Model
qual_pred <- predict(bestmod.lin,banana_test)
result_1 <- table(predict=qual_pred,truth=banana_test$Quality)
result_1

##        truth
## predict  Bad Good
##    Bad  1059  125
##    Good  148 1104

sum(diag(result_1))/sum(result_1)

## [1] 0.887931

tune.out_rad <- tune(svm,Quality~.,data=banana_train,kernel="radial",
                 ranges=list(
                   cost=c(.001,.01,.1,1,10,100),
                   gamma=c(.01,.1,.25,.5,.75,1)
                   )
                 )
summary(tune.out_rad)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1  0.25
## 
## - best performance: 0.01599525 
## 
## - Detailed performance results:
##     cost gamma      error  dispersion
## 1  1e-03  0.01 0.50736893 0.010913198
## 2  1e-02  0.01 0.12671009 0.013134808
## 3  1e-01  0.01 0.10765793 0.012707938
## 4  1e+00  0.01 0.06901502 0.009361660
## 5  1e+01  0.01 0.03827997 0.006496021
## 6  1e+02  0.01 0.02731650 0.006486113
## 7  1e-03  0.10 0.46872732 0.022222530
## 8  1e-02  0.10 0.05301913 0.004881469
## 9  1e-01  0.10 0.02875470 0.005982885
## 10 1e+00  0.10 0.01905054 0.004574860
## 11 1e+01  0.10 0.01743377 0.002818109
## 12 1e+02  0.10 0.01923104 0.003602767
## 13 1e-03  0.25 0.50736893 0.010913198
## 14 1e-02  0.25 0.03217100 0.005321343
## 15 1e-01  0.25 0.01994950 0.005895072
## 16 1e+00  0.25 0.01599525 0.002735259
## 17 1e+01  0.25 0.01707341 0.003713369
## 18 1e+02  0.25 0.02174580 0.006955020
## 19 1e-03  0.50 0.50736893 0.010913198
## 20 1e-02  0.50 0.02552116 0.005345576
## 21 1e-01  0.50 0.01797205 0.004058880
## 22 1e+00  0.50 0.01671338 0.004400046
## 23 1e+01  0.50 0.01725295 0.004495918
## 24 1e+02  0.50 0.02678080 0.005594615
## 25 1e-03  0.75 0.50736893 0.010913198
## 26 1e-02  0.75 0.03198953 0.007608548
## 27 1e-01  0.75 0.01743313 0.003975160
## 28 1e+00  0.75 0.01617510 0.003167518
## 29 1e+01  0.75 0.01941090 0.005672607
## 30 1e+02  0.75 0.02785671 0.006625851
## 31 1e-03  1.00 0.50736893 0.010913198
## 32 1e-02  1.00 0.07331349 0.029236675
## 33 1e-01  1.00 0.01815191 0.004270961
## 34 1e+00  1.00 0.01635528 0.002990260
## 35 1e+01  1.00 0.02013000 0.006443375
## 36 1e+02  1.00 0.02552084 0.006660753

bestfullmod <- tune.out_rad$best.model
summary(bestfullmod)

## 
## Call:
## best.tune(METHOD = svm, train.x = Quality ~ ., data = banana_train, 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 10, 100), gamma = c(0.01, 
##         0.1, 0.25, 0.5, 0.75, 1)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  604
## 
##  ( 298 306 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Bad Good

## Prediction with SVM-Radial Model
class_pred <- predict(bestfullmod,banana_test)
result <- table(predict=class_pred,truth=banana_test$Quality)
result

##        truth
## predict  Bad Good
##    Bad  1185   26
##    Good   22 1203

sum(diag(result))/sum(result)

## [1] 0.9802956

Analysis of Model Performance: Support Vector Machine Classification

After tuning, the Support Vector Machine classification with a linear kernel had a cost of 0.01 and used 2071 support vectors. This model yielded an 88.9% accuracy rate; better than each classical model other than KNN, but far below the accuracy rate of Classification Trees, Bagging Classification Trees and Random Forest. On the other hand, after tuning, the Support Vector Machine classification with a radial kernel had a cost of 1, gamma value of 0.015, and used 882 support vectors to yield a prediction with 98.1% accuracy. This is the strongest accuracy of any method tested, both from classical and modern approaches.

Conclusion:

We set out to apply both classical and modern classification methods to predict classification of banana quality for the banana_quality dataset, with a stated goal of achieving 95% accuracy. We were able to exceed that goal, attaining 98% accuracy, by employing an SVM classification with radial kernel. In a real world context, being able to classify banana quality at a 98% accuracy based on measured characteristics, would be a powerful result.

Banana Split

Shaun Bardell

2024-05-05

Math 645 Final Project

The problem:

Motivation:

Examining the Data: banana_quality

Correlation Plot

Establishing a Train and Test Set

Classiscal Approaches and Analysis

Logistic Regression

Analysis of Model Performance: Logistic Regression

Linear Discriminant Analysis

Analysis of Model Performance: Linear Discriminant Analysis

Quadratic Discriminant Analysis

Analysis of Model Performance: Quadratic Discriminant Analysis

Naive Bayes

Analysis of Model Performance: Naive Bayes

K-Nearest Neighbors (KNN)

Analysis of Model Performance: K-Nearest Neighbors

Modern Approaches and Analysis

Classification Tree

Analysis of Model Performance:Classification Tree

Bagging Classification Tree

Analysis of Model Performance:Bagging Classification Tree

Random Forest

Analysis of Model Performance:Random Forest Classification

Support Vector Machine Classification

Analysis of Model Performance: Support Vector Machine Classification

Conclusion: