The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled. We will remove the records with NA values.
## species island bill_length_mm bill_depth_mm
## Adelie :146 Biscoe :163 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :123 1st Qu.:39.50 1st Qu.:15.60
## Gentoo :119 Torgersen: 47 Median :44.50 Median :17.30
## Mean :43.99 Mean :17.16
## 3rd Qu.:48.60 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## flipper_length_mm body_mass_g sex
## Min. :172 Min. :2700 female:165
## 1st Qu.:190 1st Qu.:3550 male :168
## Median :197 Median :4050
## Mean :201 Mean :4207
## 3rd Qu.:213 3rd Qu.:4775
## Max. :231 Max. :6300
The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.
## [1] Adelie Gentoo Chinstrap
## Levels: Adelie Chinstrap Gentoo
We have 2 categorical variables island and sex and 4 continuous or numerical variables bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g.
We will drop the categorical variables island and sex since we will be using disciminant analysis which needs continous variables.
Looks like island feature has some interesting observations. If penguin lives in Torgersen island then we can say that it belongs to Adelie species. Dream island has only Adelie and Chinstrap penguins. Biscoe island has Adelie or Gentoo penguins. We will keep island feature in our model.
Now lets check continous or numerican features and plot them against each other.
By looking at the featureplot we can say that feature bill_length is playing important role in distingushing the species.bill_depth,flipper_length,body_mass also play some role in distinguishing species along with bill_length.
featurePlot(x=penguins_df[,2:5], y=penguins_df$species, plot="box",
scales = list(y = list(relation="free"),
x = list(rot = 90)),
layout = c(4,1 ),
auto.key = list(columns = 3))From the above boxplot it looks like flipper_length and body_mass are probably correlated.
featurePlot(x=penguins_df[,2:5], y=penguins_df$species, plot="density",
## Pass in options to xyplot() to
## make it prettier
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(4, 1),
auto.key = list(columns = 3)) From the density plot, it looks like the above features are aproximatelly normally distributed. This is an important assumption of LDA and QDA.
Let’s find the corelation between the features.
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.0000000 -0.2286256 0.6530956 0.5894511
## bill_depth_mm -0.2286256 1.0000000 -0.5777917 -0.4720157
## flipper_length_mm 0.6530956 -0.5777917 1.0000000 0.8729789
## body_mass_g 0.5894511 -0.4720157 0.8729789 1.0000000
As we observed before flipper_length and body_mass are highly correlated . We will keep body_mass since it is less correlated compared to flipper_length and will drop flipper_length feature from our model.
The data is split into training and testing sets 80%/20%. The test set contains 65 observations.
set.seed(622)
trainIndex <- createDataPartition(penguins_final$species, p = .8, list = FALSE, times = 1)
training <- penguins_final[ trainIndex,]
testing <- penguins_final[-trainIndex,]
dim(testing)## [1] 65 4
LDA assumes that the predictors are distributed as multivariate gaussian with common covariance. From our distribution plots above, LDA seems to be a good fit.
By the proportion of trace, we can see that the first linear discriminant LD1 achieves 83.7% separation and the second, LDA 16.26%.
## Call:
## lda(species ~ ., data = training)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## bill_length_mm bill_depth_mm body_mass_g
## Adelie 38.81880 18.36752 3707.692
## Chinstrap 48.73091 18.51091 3738.182
## Gentoo 47.56562 15.01979 5088.281
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 0.141524248 -0.408190182
## bill_depth_mm -1.044885950 -0.034633623
## body_mass_g 0.001989374 0.001903976
##
## Proportion of trace:
## LD1 LD2
## 0.8374 0.1626
## [1] 0.007462687
## [1] 0.01538462
#table(predicted = model_lda.pred$class, actual = testing$species)
cm.lda <- confusionMatrix( predictions_test$class, testing$species)
cm.lda## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 0 0
## Chinstrap 1 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9759
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 1.0000 1.0000
## Specificity 1.0000 0.9808 1.0000
## Pos Pred Value 1.0000 0.9286 1.0000
## Neg Pred Value 0.9730 1.0000 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.2000 0.3538
## Detection Prevalence 0.4308 0.2154 0.3538
## Balanced Accuracy 0.9828 0.9904 1.0000
From the confusion matrix we can see that we are getting 98.46% accuracy which is quite reasonable
The partition plot below helps to visualize LDA. The observations that end up misclassified are shown in red. In the case of LDA, the decision boundary is linear.
Unlike LDA, QDA assumes that each class has its own covariance matrix. As a result, the decision boundary is quadratic.
## Call:
## qda(species ~ ., data = training)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4365672 0.2052239 0.3582090
##
## Group means:
## bill_length_mm bill_depth_mm body_mass_g
## Adelie 38.81880 18.36752 3707.692
## Chinstrap 48.73091 18.51091 3738.182
## Gentoo 47.56562 15.01979 5088.281
qda_predictions <- predict(model_qda, testing)
cm.qda <- confusionMatrix(qda_predictions$class, testing$species)
#qda.acc <- cm.qda[[3]][1]
#scores <-rbind(scores, training.frame(model="QDA", accuracy=qda.acc))
cm.qda## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 28 0 0
## Chinstrap 1 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9846
## 95% CI : (0.9172, 0.9996)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9759
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9655 1.0000 1.0000
## Specificity 1.0000 0.9808 1.0000
## Pos Pred Value 1.0000 0.9286 1.0000
## Neg Pred Value 0.9730 1.0000 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4308 0.2000 0.3538
## Detection Prevalence 0.4308 0.2154 0.3538
## Balanced Accuracy 0.9828 0.9904 1.0000
Looks like QDA model also has same accuracy 98.46% as LDA model. It seems that even with QDA most decision boundaries remain nearly linear.
The Naive Bayes classifier assumes that all features are equally important and independent which is often not the case and may result in some bias. However, the assumption of independence simplifies the computations by turning conditional probabilities into products of probabilities. Here we do not need to determine the exact posterior probability but simply which class is more likely.
The classification result for Naive Bayes on the test set yielded 96.9% accuracy and two misclassified penguins.
features <- setdiff(names(training), "species")
x <- training[,features]
y <- training$species
model.naive <- naiveBayes(x = x,y = y, laplace = 1)
result.naive <- predict(model.naive, testing %>% dplyr::select(-species))
# Make confusion matrix
cm.naive <- confusionMatrix(result.naive, testing$species)
nb.acc <- cm.naive[[3]][1]
cm.naive## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 27 0 0
## Chinstrap 2 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9692
## 95% CI : (0.8932, 0.9963)
## No Information Rate : 0.4462
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9522
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 0.9310 1.0000 1.0000
## Specificity 1.0000 0.9615 1.0000
## Pos Pred Value 1.0000 0.8667 1.0000
## Neg Pred Value 0.9474 1.0000 1.0000
## Prevalence 0.4462 0.2000 0.3538
## Detection Rate 0.4154 0.2000 0.3538
## Detection Prevalence 0.4154 0.2308 0.3538
## Balanced Accuracy 0.9655 0.9808 1.0000
The classification result for Naive Bayes on the test set yielded 96.9% accuracy and two misclassified penguins.
LDA and QDA models are nearly identical and performed with the same accuracy. The Naive Bayes model tend to have lower accuracy and a longer lower tail than LDA.
In general LDA models are more stable than logistic regression when the classes are well separated and the features are approximately normal, which is the case here. However, by assuming that the predictors are distributed as multivariate gaussian with common covariance we are limiting the flexibility of the fit. When comparing partition plots between LDA and QDA we saw that most QDA boundaries were nearly linear showing that the LDA assumptions nearly hold. The correlation between bill depth and flipper length as well as body weight and flipper length was an early indication that QDA might be a better fit. However, QDA provided no meaningful gain in performance over LDA. Therefore we prefer the LDA model which is more parsimious since it has fewer covariance parameters to estaimte.
While the Naive Bayes model is easy to implement it likely suffers in our case from the assumption of feature independence. As stated above, some of the predictor variables were correlated, thereby violating the assumption and introducing bias.
Comments about our LDA model
Prior probabilities of groups: the proportion of training observations in each group. For example, there are 43% of the training observations in the Adelie group
Group means: group center of gravity. Shows the mean of each variable in each group.
Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule. for example,
LD1 = 0.141bill_length_mm - 1.044bill_depth_mm + 0.00198*body_mass_g
Similarly, LD2 = -0.408bill_length_mm - 0.034bill_depth_mm + 0.0019*body_mass_g
The “proportion of trace” is the percentage separation achieved by each discriminant function.In our case LD1 provided 83.7% and LD2 provided 16.26% separation.
Using the function plot() produces plots of the linear discriminants, obtained by computing LD1 and LD2 for each of the training observations.