For today I will demonstrate some quick examples of classification methods. The data I will utilize are from the Apple Juice dataset downloaded from http://www.stat.ufl.edu/~winner/datasets.html. The data describe either the absence or presence of growth of CRA7152 in apple juice as a function of pH, Brix, temperature, and Nisin concentration. The data were downloaded as a .dat file with 74 rows and the following columns:
The original source of the data, as I understand it, is “Modeling the Growth Limit of Alicyclobacillus Acidoterrestris CRA7152 in Apple Juice: Effect of pH, Brix, Temperature, and Nisin Concentration” by W.E.L. Pena, P.R. De Massaguer, A.D.G. Zuniga, and S.H. Saraiva in the Journal of Food Processing and Preservation, Vol. 35, pp. 509-517.
After downloading the data, we apply the appropriate names to each of the variables.
juice <- read.table('apple_juice.dat',sep='')
names(juice) <- c('pH','nisin','temp','brix','growth')
juice$growth <- as.factor(juice$growth)
We will now break up the dataset into training and test sets. This will allow us to train the models on the training set and evaluate its predicability on the test set.
set.seed(111)
train <- sample(74,44)
juice.train <- juice[train,]
juice.test <- juice[-train,]
First off, we will examine the data using Logistic Regression through the glm() command. Note that binomial is used within the command.
juice.glm <- glm(growth~.,data=juice.train,family=binomial)
summary(juice.glm)
##
## Call:
## glm(formula = growth ~ ., family = binomial, data = juice.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.53731 -0.23740 -0.06874 0.27213 1.69750
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.40430 3.56228 -0.394 0.69342
## pH 1.16600 0.70598 1.652 0.09861 .
## nisin -0.06829 0.02629 -2.598 0.00939 **
## temp 0.15683 0.07674 2.044 0.04099 *
## brix -0.66822 0.25940 -2.576 0.00999 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 56.464 on 43 degrees of freedom
## Residual deviance: 25.723 on 39 degrees of freedom
## AIC: 35.723
##
## Number of Fisher Scoring iterations: 7
Now that we have our initial model, it is then used to predict on the test data for the presence of CRA7152 (if growth = 1 as opposed to 0). Probabilities are taken for each row of the test data and a object is used to indicate the predicted values of one or zero. The threshold here is that any probability less than .5 will be counted as zero, the rest as one.
juice.glm.probs <- predict(juice.glm,newdata=juice.test,type='response')
juice.pred <- rep(1,30)
juice.pred[juice.glm.probs < .5] <- 0
We can, then, construct a table comparing the predicted results with the actual results from the test data. From there we can calculate the percentage of correct predictions for the test data.
table1 <- table(juice.pred,juice.test$growth)
table1
##
## juice.pred 0 1
## 0 16 9
## 1 3 2
(table1[1,1] + table1[2,2])/30
## [1] 0.6
Next, we will use Linear Discriminant Analysis to make predictions on the Apple Juice data. We will need to use the lda() function which is part of the MASS library.
library(MASS)
set.seed(222)
juice.lda <- lda(growth~.,data=juice.train)
juice.lda
## Call:
## lda(growth ~ ., data = juice.train)
##
## Prior probabilities of groups:
## 0 1
## 0.6590909 0.3409091
##
## Group means:
## pH nisin temp brix
## 0 4.431034 42.75862 37.31034 15.48276
## 1 4.766667 14.66667 42.20000 12.46667
##
## Coefficients of linear discriminants:
## LD1
## pH 0.56423256
## nisin -0.03145867
## temp 0.05831529
## brix -0.25538111
With this model predictions are also made on the test data. In this case, however, the posterior classifications are already made. A separate object is used to represent these predictions and compare them with the actual results from the test data via another table. A percentage of correct predictions made is then determined.
juice.lda.pred <- predict(juice.lda,newdata=juice.test)
names(juice.lda.pred)
## [1] "class" "posterior" "x"
juice.lda.class <- juice.lda.pred$class
table2 <- table(juice.lda.class,juice.test$growth)
table2
##
## juice.lda.class 0 1
## 0 15 7
## 1 4 4
(table2[1,1]+table2[2,2])/30
## [1] 0.6333333
We now follow a similar process, but this time we will use Quadratic Discriminant Analysis using the qda() function from the MASS library. Here a quadratic, instead of a linear, function of the variables is used.
set.seed(333)
juice.qda <- qda(growth~.,data=juice.train)
juice.qda
## Call:
## qda(growth ~ ., data = juice.train)
##
## Prior probabilities of groups:
## 0 1
## 0.6590909 0.3409091
##
## Group means:
## pH nisin temp brix
## 0 4.431034 42.75862 37.31034 15.48276
## 1 4.766667 14.66667 42.20000 12.46667
juice.qda.pred <- predict(juice.qda,newdata=juice.test)
names(juice.qda.pred)
## [1] "class" "posterior"
juice.qda.class <- juice.qda.pred$class
table3 <- table(juice.qda.class,juice.test$growth)
table3
##
## juice.qda.class 0 1
## 0 18 6
## 1 1 5
(table3[1,1]+table3[2,2])/30
## [1] 0.7666667
Finally, we can utilize K-Nearest Neighbors to make predictions. In order to do this, we will need the knn() function from the class library. In this example, training and test sets will have to be in matrix form. For the purpose of this example, the dependent variable, growth, will be left out of both sets and be assigned their own objects. In addition, the training and test matrices are then scaled using the scale() function. Once a model is made with knn() and a value of k=2, its predictions are evaluated in the same manner as the previous examples.
library(class)
set.seed(444)
attach(juice)
juice2 <- cbind(pH,nisin,temp,brix)
juice2 <- scale(juice2)
train.x <- juice2[train,]
train.y <- growth[train]
test.x <- juice2[-train,]
test.y <- growth[-train]
juice.knn <- knn(train.x,test.x,train.y,k=2)
table4 <- table(juice.knn,test.y)
table4
## test.y
## juice.knn 0 1
## 0 15 1
## 1 4 10
(table4[1,1]+table4[2,2])/30
## [1] 0.8333333