Synopsis: Suppose we know that n = 14 particular ferns have a given property - they contain a valuable antibiotic - and n = 22 different ferns are known not to have this property. We collect 4 measurements X1-X4 on each of the ferns. Question: Given another fern of unknown type, what is our best guess as to which of the two types of ferns the new fern belongs to, based on the values X1-X4 for the new fern? That is, what is the best way to classify a new fern on the basis of these measurements.
There are n = 36 ferns of of known type: n = 14 contain an antibiotic and n = 22 are non-antibiotic. In addition, we have 4 new ferns of unknown type - that is, we do not know if they contain the antibiotic or not. Thus, these are the ferns that we would like to classify using our model.
setwd("/Users/cameliaguild/Rpubs")
# Load the data
train.ferns <- read.csv("train.ferns.csv", header = TRUE)
unk.ferns <- read.csv("unknown.ferns.csv", header = TRUE)
#Ferns of known type
head(train.ferns, 3)## group measure1 measure2 measure3 measure4
## 1 1 57 66 18 34
## 2 1 45 95 60 75
## 3 1 36 91 68 66
tail(train.ferns, 3)## group measure1 measure2 measure3 measure4
## 34 2 36 88 36 90
## 35 2 35 73 47 87
## 36 2 44 56 41 57
# Ferns of unknown type
head(unk.ferns)## group measure1 measure2 measure3 measure4
## 1 NA 34 99 15 50
## 2 NA 45 70 49 86
## 3 NA 50 82 51 104
## 4 NA 46 119 43 79
# Add a new column, type, with binary coding for glm model
train.ferns$type[train.ferns$group == 1] <- 1
train.ferns$type[train.ferns$group == 2] <- 0
train.ferns$type <- factor(train.ferns$type)The antibiotic group seems to have high measure2 values and lower measure4 values versus lower measure2 values and higher measure4 values for the non-antibiotic group.
group <- factor(train.ferns$group)
plot(train.ferns$measure4, train.ferns$measure2, col= train.ferns$group, pch=18, main = "Plot of Observed Ferns of Known Type by Measure2 and Measure4", xlab="Measure4", ylab="Measure2")
text(train.ferns$measure4, train.ferns$measure2, labels=train.ferns$group, font=2, pos=4)
legend("topleft", legend = c("antibiotic", "non-antibiotic"), col = 1:nlevels(group), pch=18, title = "Type")A summary of group means for each measurement does confirm that the antibiotic group has a high mean value on measure2 and a lower mean value on measure4 versus a lower mean value on measure2 and a higher mean value on measure4 for the non-antibiotic group.
# Look at the mean for each measure by group
library(tidyverse)
train.ferns %>% group_by(group) %>% summarise(meanMeasure1=mean(measure1), meanMeasure2=mean(measure2), meanMeasure3=mean(measure3), meanMeasure4=mean(measure4))## # A tibble: 2 × 5
## group meanMeasure1 meanMeasure2 meanMeasure3 meanMeasure4
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 48.4 86.8 41.5 66.9
## 2 2 42.0 68.8 44.3 83.2
Since there is no variable pair that is highly correlated (rho > 0.8), our model will not suffer from the effects of multicollinearity.
# Look at correlations
subset <- dplyr::select(train.ferns, starts_with("m"))
cor(subset)## measure1 measure2 measure3 measure4
## measure1 1.0000000 0.33090680 -0.24558296 -0.3424607
## measure2 0.3309068 1.00000000 0.03877515 0.0146210
## measure3 -0.2455830 0.03877515 1.00000000 0.2730523
## measure4 -0.3424607 0.01462100 0.27305231 1.0000000
An independent sample T-test on group mean differences between antibiotic and non-antibiotic are statistically significant on measure2 and measure4 only (output for measure1 and measure3 are not shown).
# Two-sample (independent sample) t-test
#t.test(measure1 ~ group, data=train.ferns, var.equal=TRUE)
t.test(measure2 ~ group, data=train.ferns, var.equal=TRUE)##
## Two Sample t-test
##
## data: measure2 by group
## t = 3.892, df = 34, p-value = 0.0004409
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 8.585536 27.349529
## sample estimates:
## mean in group 1 mean in group 2
## 86.78571 68.81818
#t.test(measure3 ~ group, data=train.ferns, var.equal=TRUE)
t.test(measure4 ~ group, data=train.ferns, var.equal=TRUE)##
## Two Sample t-test
##
## data: measure4 by group
## t = -3.1385, df = 34, p-value = 0.0035
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -26.852362 -5.745041
## sample estimates:
## mean in group 1 mean in group 2
## 66.92857 83.22727
In order to not over-fit our model due to a small sample size, model selection will proceed as follows: Step (1). Fit a model with all 4 possible explanatory variables. Step (2). Retain only those variables that are statistically significant if the p-value < 0.05. Thus, all further modeling will only include statistically significant variables from step (1).
Model 1 below specifies all four variables, however only measure2 and meeasure4 are statistically significant at a p-value < 0.05.
# Fit a glm (logistic) regression model. Only measure2 and measure4 are significant.
glm.fit1 <- glm(type ~ measure1 + measure2 + measure3 + measure4, data=train.ferns, family=binomial(link=logit))
summary(glm.fit1)##
## Call:
## glm(formula = type ~ measure1 + measure2 + measure3 + measure4,
## family = binomial(link = logit), data = train.ferns)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.45333 -0.29416 -0.07306 0.25188 2.72210
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.19461 6.37343 0.815 0.41505
## measure1 -0.08433 0.06392 -1.319 0.18708
## measure2 0.19958 0.07479 2.668 0.00762 **
## measure3 -0.06796 0.07919 -0.858 0.39079
## measure4 -0.19440 0.08346 -2.329 0.01984 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 48.114 on 35 degrees of freedom
## Residual deviance: 18.806 on 31 degrees of freedom
## AIC: 28.806
##
## Number of Fisher Scoring iterations: 7
Model 2 below, is specified to only include measure2 and measure4 since these were the only statistically significant variables from Model 1. Therefore, this model will be used to classify the data containing ferns of unknown type.
# Re-fit model with only measure2 and measure4
glm.fit2 <- glm(type ~ measure2 + measure4, data=train.ferns, family=binomial(link=logit))
summary(glm.fit2)##
## Call:
## glm(formula = type ~ measure2 + measure4, family = binomial(link = logit),
## data = train.ferns)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9029 -0.4124 -0.1070 0.3624 2.2405
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.30166 3.75516 -0.347 0.72887
## measure2 0.14021 0.04645 3.019 0.00254 **
## measure4 -0.13580 0.05163 -2.630 0.00853 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 48.114 on 35 degrees of freedom
## Residual deviance: 21.273 on 33 degrees of freedom
## AIC: 27.273
##
## Number of Fisher Scoring iterations: 6
The model error rate, accuracy, sensitivity and specificity are calculated below. The training model error rate is 14% and thus the accuracy is 86%. Model sensitivity and specificity are 78.6% and 90.9%, respectively.
# Calculate the training data classification error rate.
model_glm_pred = ifelse(predict(glm.fit2, type="response") > 0.5, "1", "0")
calc_class_err = function(actual, predicted){
mean(actual !=predicted)
}
calc_class_err(actual = train.ferns$type, predicted = model_glm_pred)## [1] 0.1388889
# Model Accuracy
#1-calc_class_err(actual = train.ferns$type, predicted = model_glm_pred)
train_df = table(predicted = model_glm_pred, actual = train.ferns$type)
library(caret)
train_conMat = confusionMatrix(train_df, positive = "1")
c(train_conMat$overall["Accuracy"],
train_conMat$byClass["Sensitivity"],
train_conMat$byClass["Specificity"])## Accuracy Sensitivity Specificity
## 0.8611111 0.7857143 0.9090909
First, let’s take a look at where the unknown ferns (coded as type =3) belong on the plot. There are two ferns with higher measure2 values and lower measure4 values; and two ferns with higher measure4 values and lower measure2 values. Thus, it seems that two are antibiotic and 2 are non-antibiotic.
# Add a new column, type, coded as 3
unk.ferns$type <- 3
unk.ferns$type <- factor(unk.ferns$type)
train.ferns$type <- ifelse(test=train.ferns$type == "0", "2", "1")
df <- rbind(train.ferns, unk.ferns)
group <- factor(df$group)
plot(df$measure4, df$measure2, col= df$type, pch=18, main = "Plot of Ferns of Unknown Type (= 3)", xlab="Measure4", ylab="Measure2")
text(df$measure4, df$measure2, labels=df$type, font=2, pos=4)
legend("topleft", legend = c("antibiotic", "non-antibiotic"), col = 1:nlevels(group), pch=18, title = "Type")Next, apply the prediction model to the new data containing ferns of unknown type. The model classifies ferns 1 and 4 as antibiotic (type 1) and ferns 2 and 3 as non-antibiotic (type 2).
# predict model on new data of (unknown fern type)
pred_glm <- predict(glm.fit2, newdata = unk.ferns, type="response")
pred_glm## 1 2 3 4
## 0.99694775 0.04047104 0.01930790 0.99057392
Note: The result of the prediction model is not surprising given what we observed in the plot. In this simple analysis, either the plot or the regression model could be used to classify the unknown ferns.