Introduction

Synopsis: Suppose we know that n = 14 particular ferns have a given property - they contain a valuable antibiotic - and n = 22 different ferns are known not to have this property. We collect 4 measurements X1-X4 on each of the ferns. Question: Given another fern of unknown type, what is our best guess as to which of the two types of ferns the new fern belongs to, based on the values X1-X4 for the new fern? That is, what is the best way to classify a new fern on the basis of these measurements.

Viewing the data

There are n = 36 ferns of of known type: n = 14 contain an antibiotic and n = 22 are non-antibiotic. In addition, we have 4 new ferns of unknown type - that is, we do not know if they contain the antibiotic or not. Thus, these are the ferns that we would like to classify using our model.

setwd("/Users/cameliaguild/Rpubs")
# Load the data
train.ferns <- read.csv("train.ferns.csv", header = TRUE)
unk.ferns <- read.csv("unknown.ferns.csv", header = TRUE)
#Ferns of known type
head(train.ferns, 3)
##   group measure1 measure2 measure3 measure4
## 1     1       57       66       18       34
## 2     1       45       95       60       75
## 3     1       36       91       68       66
tail(train.ferns, 3)
##    group measure1 measure2 measure3 measure4
## 34     2       36       88       36       90
## 35     2       35       73       47       87
## 36     2       44       56       41       57
# Ferns of unknown type
 head(unk.ferns)
##   group measure1 measure2 measure3 measure4
## 1    NA       34       99       15       50
## 2    NA       45       70       49       86
## 3    NA       50       82       51      104
## 4    NA       46      119       43       79
# Add a new column, type, with binary coding for glm model
train.ferns$type[train.ferns$group == 1] <- 1
train.ferns$type[train.ferns$group == 2] <- 0
train.ferns$type <- factor(train.ferns$type)

Exploratory Data Analysis

The antibiotic group seems to have high measure2 values and lower measure4 values versus lower measure2 values and higher measure4 values for the non-antibiotic group.

group <- factor(train.ferns$group)
plot(train.ferns$measure4, train.ferns$measure2, col= train.ferns$group, pch=18, main = "Plot of Observed Ferns of Known Type by Measure2 and Measure4", xlab="Measure4", ylab="Measure2")
text(train.ferns$measure4, train.ferns$measure2, labels=train.ferns$group, font=2, pos=4)
legend("topleft", legend = c("antibiotic", "non-antibiotic"), col = 1:nlevels(group), pch=18, title = "Type")

A summary of group means for each measurement does confirm that the antibiotic group has a high mean value on measure2 and a lower mean value on measure4 versus a lower mean value on measure2 and a higher mean value on measure4 for the non-antibiotic group.

# Look at the mean for each measure by group
library(tidyverse)
train.ferns %>% group_by(group) %>% summarise(meanMeasure1=mean(measure1), meanMeasure2=mean(measure2), meanMeasure3=mean(measure3), meanMeasure4=mean(measure4))
## # A tibble: 2 × 5
##   group meanMeasure1 meanMeasure2 meanMeasure3 meanMeasure4
##   <int>        <dbl>        <dbl>        <dbl>        <dbl>
## 1     1         48.4         86.8         41.5         66.9
## 2     2         42.0         68.8         44.3         83.2

Look at the Correlations

Since there is no variable pair that is highly correlated (rho > 0.8), our model will not suffer from the effects of multicollinearity.

# Look at correlations
subset <- dplyr::select(train.ferns, starts_with("m"))
cor(subset)
##            measure1   measure2    measure3   measure4
## measure1  1.0000000 0.33090680 -0.24558296 -0.3424607
## measure2  0.3309068 1.00000000  0.03877515  0.0146210
## measure3 -0.2455830 0.03877515  1.00000000  0.2730523
## measure4 -0.3424607 0.01462100  0.27305231  1.0000000

Testing For A Mean Difference

An independent sample T-test on group mean differences between antibiotic and non-antibiotic are statistically significant on measure2 and measure4 only (output for measure1 and measure3 are not shown).

# Two-sample (independent sample) t-test
#t.test(measure1 ~ group, data=train.ferns, var.equal=TRUE)
t.test(measure2 ~ group, data=train.ferns, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  measure2 by group
## t = 3.892, df = 34, p-value = 0.0004409
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   8.585536 27.349529
## sample estimates:
## mean in group 1 mean in group 2 
##        86.78571        68.81818
#t.test(measure3 ~ group, data=train.ferns, var.equal=TRUE)
t.test(measure4 ~ group, data=train.ferns, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  measure4 by group
## t = -3.1385, df = 34, p-value = 0.0035
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -26.852362  -5.745041
## sample estimates:
## mean in group 1 mean in group 2 
##        66.92857        83.22727

Multiple Logistic Regression

In order to not over-fit our model due to a small sample size, model selection will proceed as follows: Step (1). Fit a model with all 4 possible explanatory variables. Step (2). Retain only those variables that are statistically significant if the p-value < 0.05. Thus, all further modeling will only include statistically significant variables from step (1).

Model 1 below specifies all four variables, however only measure2 and meeasure4 are statistically significant at a p-value < 0.05.

# Fit a glm (logistic) regression model. Only measure2 and measure4 are significant.
glm.fit1 <- glm(type ~ measure1 + measure2 + measure3 + measure4, data=train.ferns, family=binomial(link=logit))
summary(glm.fit1)
## 
## Call:
## glm(formula = type ~ measure1 + measure2 + measure3 + measure4, 
##     family = binomial(link = logit), data = train.ferns)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.45333  -0.29416  -0.07306   0.25188   2.72210  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  5.19461    6.37343   0.815  0.41505   
## measure1    -0.08433    0.06392  -1.319  0.18708   
## measure2     0.19958    0.07479   2.668  0.00762 **
## measure3    -0.06796    0.07919  -0.858  0.39079   
## measure4    -0.19440    0.08346  -2.329  0.01984 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48.114  on 35  degrees of freedom
## Residual deviance: 18.806  on 31  degrees of freedom
## AIC: 28.806
## 
## Number of Fisher Scoring iterations: 7

Model 2 below, is specified to only include measure2 and measure4 since these were the only statistically significant variables from Model 1. Therefore, this model will be used to classify the data containing ferns of unknown type.

# Re-fit model with only measure2 and measure4
glm.fit2 <- glm(type ~ measure2 + measure4, data=train.ferns, family=binomial(link=logit))
summary(glm.fit2)
## 
## Call:
## glm(formula = type ~ measure2 + measure4, family = binomial(link = logit), 
##     data = train.ferns)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9029  -0.4124  -0.1070   0.3624   2.2405  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.30166    3.75516  -0.347  0.72887   
## measure2     0.14021    0.04645   3.019  0.00254 **
## measure4    -0.13580    0.05163  -2.630  0.00853 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48.114  on 35  degrees of freedom
## Residual deviance: 21.273  on 33  degrees of freedom
## AIC: 27.273
## 
## Number of Fisher Scoring iterations: 6

Logistic Regression Model Evaluation

The model error rate, accuracy, sensitivity and specificity are calculated below. The training model error rate is 14% and thus the accuracy is 86%. Model sensitivity and specificity are 78.6% and 90.9%, respectively.

# Calculate the training data classification error rate.
model_glm_pred = ifelse(predict(glm.fit2, type="response") > 0.5, "1", "0")
calc_class_err = function(actual, predicted){
    mean(actual !=predicted)
}
calc_class_err(actual = train.ferns$type, predicted = model_glm_pred)
## [1] 0.1388889
# Model Accuracy
#1-calc_class_err(actual = train.ferns$type, predicted = model_glm_pred)
train_df = table(predicted = model_glm_pred, actual = train.ferns$type)
library(caret)
train_conMat = confusionMatrix(train_df, positive = "1")
c(train_conMat$overall["Accuracy"],
  train_conMat$byClass["Sensitivity"],
  train_conMat$byClass["Specificity"])
##    Accuracy Sensitivity Specificity 
##   0.8611111   0.7857143   0.9090909

Using the Logistic Regression Model for Classification

First, let’s take a look at where the unknown ferns (coded as type =3) belong on the plot. There are two ferns with higher measure2 values and lower measure4 values; and two ferns with higher measure4 values and lower measure2 values. Thus, it seems that two are antibiotic and 2 are non-antibiotic.

Plot of Measured Data Values for Ferns of Unknown Type

# Add a new column, type, coded as 3
unk.ferns$type  <- 3
unk.ferns$type <- factor(unk.ferns$type)
train.ferns$type <- ifelse(test=train.ferns$type == "0", "2", "1")
df <- rbind(train.ferns, unk.ferns)
group <- factor(df$group)

plot(df$measure4, df$measure2, col= df$type, pch=18, main = "Plot of Ferns of Unknown Type (= 3)", xlab="Measure4", ylab="Measure2")
text(df$measure4, df$measure2, labels=df$type, font=2, pos=4)
legend("topleft", legend = c("antibiotic", "non-antibiotic"), col = 1:nlevels(group), pch=18, title = "Type")

Predict Ferns of Unknown Type

Next, apply the prediction model to the new data containing ferns of unknown type. The model classifies ferns 1 and 4 as antibiotic (type 1) and ferns 2 and 3 as non-antibiotic (type 2).

# predict model on new data of (unknown fern type)
pred_glm <- predict(glm.fit2, newdata = unk.ferns, type="response")
pred_glm
##          1          2          3          4 
## 0.99694775 0.04047104 0.01930790 0.99057392

Note: The result of the prediction model is not surprising given what we observed in the plot. In this simple analysis, either the plot or the regression model could be used to classify the unknown ferns.