We are interested in classifying wines into three different variables of vintage states, A, B and C. The methods we are interested in using is cross-validated LDA/QDA, K- Nearest Neighbors and the Naive Bayes. Sharing commonalities of the LOOCV approach.
library(MASS)
wine <- read.csv("wine.csv")
numeric_pred <- c("ash", "ash_alkalinity", "magnesium", "nonflav_phenols",
"proanth") #Select numeric predictors
predictors <- wine[, numeric_pred] # Numerical predictors
class <- wine[, "vintage"] # Response class
pred_and_class <- wine[, c(numeric_pred, "vintage")]
lda_wine <- lda(vintage ~., data = pred_and_class, CV = TRUE) # TRUE = LOOCV
pred_class <- lda_wine$class #Select predicted response
cm <- table(pred_class, class)
cm # Confusion matrix
## class
## pred_class A B C
## A 33 1 12
## B 1 53 7
## C 14 4 52
accuracy <- sum(diag(cm))/sum(cm) #True classifications over all classifications
accuracy
## [1] 0.779661
error <- 1 - accuracy #P(Error) = 1 - P(true)
error
## [1] 0.220339
The result of the Linear Discriminant Analysis show an accuracy of vintage classified correctly equal to roughly 77.97%, with error from the cross validated LDA equal to approximately 22.03%.
qda_wine <- qda(vintage ~., data = pred_and_class, CV = TRUE) #QDA LOOCV
pred_class <- qda_wine$class
cmq <- table(pred_class, class)
cmq
## class
## pred_class A B C
## A 38 0 15
## B 0 51 7
## C 10 7 49
accuracyq <- (cmq[1, 1] + cmq[2, 2] + cmq[3, 3])/sum(cmq)
accuracyq
## [1] 0.779661
errorq <- 1 - accuracyq
errorq
## [1] 0.220339
#Approach the same as q1)
Results from the cross validated Quadratic Discriminant Analysis show the exact same proportions for accuracy and error as the LDA. The difference between the two methods is observations. However the precision and recall of the analysis may vary as different observation proportions of A, B and C vintage wines have been classified between LDA and QDA.
library(class)
for (k in 1:15) {
pred_class <- knn.cv(train = predictors, cl = class, k = k, prob = TRUE)
cm_k <- table(pred_class, class) #CV KNN
accuracy_k <- sum(diag(cm_k))/sum(cm_k)
error [k] <- 1 - accuracy_k #Saves error values k-times into an array
error <- c(error) #Error array into a vector
}
kvals <- c(1:15) # K values vector
cm_k
## class
## pred_class A B C
## A 23 1 13
## B 13 50 18
## C 12 7 40
error
## [1] 0.3502825 0.4124294 0.3559322 0.3107345 0.3276836 0.3220339 0.3107345
## [8] 0.3220339 0.3333333 0.3559322 0.3728814 0.3446328 0.3446328 0.3502825
## [15] 0.3615819
plot(error~kvals, xlab = "K Values",
main = "Scatterplot of K-values to KNN Error") #Plot Error to k-values
lines(predict(loess(error~kvals)), col ="slateblue3") #fit polynomial line
A Scatterplot of k values to the K-NN test error.
From figure one, an appropriate value for k is 7. From the line of best fit, k = 7 is a global minimum, such that the cross validated KNN has the least amount of error to classify the the different levels of vintage wine. The LDA and QDA methods performed better in circumstance as their test errors respective to the KNN test error are such that; LDA-Error = 22%, QDA-Error = 22% and KNN-Error(k=7) = 30%. Indicating KNN did not perform better then the previous methods.
Note: This method changes my values for K every time it knits, my estimate is when k = {6, 7, 8}. - I’m not sure how to get the consistent answer.
for (k in 1:15) {
pred_class <- knn.cv(train = scale(predictors), cl = class, k = k,
prob = TRUE) #CV KNN with scaled predictors
cm_k <- table(pred_class, class)
accuracy_k <- sum(diag(cm_k))/sum(cm_k)
error[k] <- 1 - accuracy_k
error <- c(error)
}
cm_k
## class
## pred_class A B C
## A 43 2 16
## B 1 54 13
## C 4 2 42
error
## [1] 0.2542373 0.2259887 0.2146893 0.2146893 0.1920904 0.1920904 0.2033898
## [8] 0.1864407 0.1977401 0.2090395 0.1977401 0.1977401 0.1920904 0.1977401
## [15] 0.2146893
plot(error~kvals, xlab = "K Values",
main = "Scatterplot of K-values to KNN Error (Scaled Predictors)")
lines(predict(loess(error~kvals)), col ="darkred")
KNN-Error to K-value plot using scaled predictor variables
From figure 2, the results show a significant decrease in error to the KNN test in q3), out-performing the previous LDA and QDA classification methods from the low error rate. An appropriate distance for k in this test would be 6, as the error line of best fit’s global minimum. Scaling numeric predictor to mean 0 and sd 1 has altered the euclidean distance that the neighborhood uses to classify observations, as previously the predictor “magnesium” had much higher measurements which effected the predictor variables dimensional space. Resulting in more accurate classification of vintage wines from the scaling.
Note: This method changes my values for K every time it knits, my estimate is k = {6, 7, 8}- I’m not sure how to get the consistent answer.
library(naivebayes)
## naivebayes 0.9.7 loaded
obs <- dim(wine)[1] # Total observations
for (test.index in 1:obs) { #Samples one new observation per loop
#trains with other 176 observations
wine_nb <- naive_bayes(vintage~., data = pred_and_class[-test.index,],)
#Predicts observation taken by loop in newdata as class type.
pred_test <- predict(wine_nb, newdata = pred_and_class[test.index, ],
type = "class")
#Predicted observation as a character A, B or C
pred_class[test.index] <- as.character(pred_test)
}
confuse <- table(pred_class, class)
confuse
## class
## pred_class A B C
## A 37 0 21
## B 2 53 7
## C 9 5 43
accuracy <- sum(diag(confuse))/sum(confuse)
error <- 1 - accuracy
accuracy
## [1] 0.7514124
error
## [1] 0.2485876
The LOOCV Naive Bayes Classifier has an accuracy of approximately 75% and classification error of roughly 25%. This classifier test did not perform as well as the previous KNN test when k = 5, or LDA and QDA analysis above.
# Same method of LOOCV as above
for(test.index in 1:obs) {
new_wine_nb <- naive_bayes(vintage~., data = wine[-test.index, ],
laplace = 1) #L = 1 for smoothening probs that = 0
new_pred_test <- predict(new_wine_nb, newdata = wine[test.index, ])
pred_class[test.index] <- as.character(new_pred_test)
}
con_matrix <- table(pred_class, class)
con_matrix
## class
## pred_class A B C
## A 43 1 6
## B 0 50 3
## C 5 7 62
acc <- sum(diag(con_matrix))/sum(con_matrix)
error <- 1 - acc
acc
## [1] 0.8757062
error
## [1] 0.1242938
Including the categorical predictor variable color, we see the accuracy increase from 75% to approximately 88% with a classification error of 12%. From these results the naive bayes classification test has proved to perform better then the previous classification methods. ## 7)
#Naive Bayes classification with no LOOCV (All obs as training data)
wine_nb_all <- naive_bayes(vintage~., train = predictors, data = wine,
laplace = 1)
pred.class <- predict(wine_nb_all, newdata = wine)
cont_table <- table(pred.class, class)
acc <- sum(diag(cont_table))/sum(cont_table)
error <- 1 - acc
paste(acc, error)
## [1] "0.88135593220339 0.11864406779661"
tables(wine_nb_all)
##
## ---------------------------------------------------------------------------------
## ::: ash (Gaussian)
## ---------------------------------------------------------------------------------
##
## ash A B C
## mean 2.4370833 2.4560345 2.2447887
## sd 0.1846902 0.2291245 0.3154673
##
## ---------------------------------------------------------------------------------
## ::: ash_alkalinity (Gaussian)
## ---------------------------------------------------------------------------------
##
## ash_alkalinity A B C
## mean 21.416667 17.062069 20.238028
## sd 2.258161 2.561375 3.349770
##
## ---------------------------------------------------------------------------------
## ::: magnesium (Gaussian)
## ---------------------------------------------------------------------------------
##
## magnesium A B C
## mean 99.31250 105.98276 94.54930
## sd 10.89047 10.22465 16.75350
##
## ---------------------------------------------------------------------------------
## ::: nonflav_phenols (Gaussian)
## ---------------------------------------------------------------------------------
##
## nonflav_phenols A B C
## mean 0.44750000 0.29017241 0.36366197
## sd 0.12413959 0.07064841 0.12396128
##
## ---------------------------------------------------------------------------------
## ::: proanth (Gaussian)
## ---------------------------------------------------------------------------------
##
## proanth A B C
## mean 1.1535417 1.8925862 1.6302817
## sd 0.4088359 0.4124193 0.6020678
##
## ---------------------------------------------------------------------------------
## ::: colour (Categorical)
## ---------------------------------------------------------------------------------
##
## colour A B C
## col1 0.21153846 0.06451613 0.42666667
## col2 0.75000000 0.20967742 0.04000000
## col3 0.01923077 0.43548387 0.02666667
## col4 0.01923077 0.29032258 0.50666667
##
## ---------------------------------------------------------------------------------
Eq.1 \[
\begin{aligned}
P(Y=A|X =col1)&=\frac{P(X=col1)P(Y=A)}{P(Y=A)+P(Y=B)+P(Y=C)}\\
&=\frac{0.21153846}{0.21153846+0.06451613+0.42666667}\\
&= 0.3010
\end{aligned}\] There is roughly a 30% chance that a vintage wine is classed A given its from category col1.
Eq.2\[\begin{aligned}
P(Y=B|X =col1)&=\frac{P(X=col1)P(Y=B)}{P(Y=A)+P(Y=B)+P(Y=C)}\\
&=\frac{0.06451613}{0.21153846+0.06451613+0.42666667}\\
&= 0.0919
\end{aligned}\] Approximately a 9% probability that wine from col1 will be classified as vintage level B.
Eq.3\[\begin{aligned}
(Y=B|X =col1)&=\frac{P(X=col1)P(Y=B)}{P(Y=A)+P(Y=B)+P(Y=C)}\\
&=\frac{0.42666667}{0.21153846+0.06451613+0.42666667}\\
&= 0.6072
\end{aligned}\] A probability of approximately 61% that a col1 colored wine is classified as vintage level C.
eq1 = 0.21153846/(0.21153846+0.06451613+0.42666667)
eq2 = 0.06451613/(0.21153846+0.06451613+0.42666667)
eq3 = 0.42666667/(0.21153846+0.06451613+0.42666667)
paste(eq1,eq2,eq3) #answers
## [1] "0.301027551094726 0.0918089912350169 0.607163457670257"
sum(eq1,eq2,eq3) #check conditional prob'= 1
## [1] 1