Welcome to the Penguin Species Classification project! This repository contains the code and resources for a machine learning model that classifies penguin species based on a various features using a Multilayer Perceptron (MLP) Neural Network. The goal of this project is to demonstrate the application of deep learning techniques in identifying different species from collected data.
Multilayer Perceptron (MLP): The core of this project is a Multilayer Perceptron, a type of artificial network known for its ability to handle complex patterns in data. The MLP is trained on a dataset of penguin features to predict the species of a given penguin.
Penguin Dataset: We will use a dataset that includes measurements such as bill length, bill depth, flipper length, and body mass. The dataset also includes the island where the penguin was found, and the year of each sample was founded. This dataset is essential for training and evaluating the performance of the MLP model.
Classification Accuracy: The model’s performance is evaluated based on classification accuracy, showcasing how well it generalizes to unseen penguin samples. The aim of this is to achieve a high accuracy in predicting the correct species of penguins.
The data was obtained from the Palmer Penguins for binary classification. The data consists of 274 rows and 7 columns.
| Column | Description |
|---|---|
| island | Island where the penguin was found (Biscoe, Dream, or, Torgersen) |
| bill_length_mm | Bill length in milimeters |
| bill_depth_mm | Bill depth in milimeters |
| flipper_length_mm | Flipper length in milimeters |
| body_mass_g | Body mass in grams |
| year | Year where the penguin was found |
| species | Species of the penguin (Adelie or Gentoo) |
To get started with this project, follow these steps:
> library(readxl)
> library(neuralnet)
> library(dplyr)
> library(caret)
> penguin_data = read_excel("C:/Users/acer/Downloads/penguin_binary_classification.xlsx")
Reprocessing Data
Since ‘island’ is a categorical data, we wil assign a unique numerical label to each island category.
> penguin_data$island <- as.numeric(factor(penguin_data$island))
Data Transformation
In this project we will use logistic activation function. This function, also known as the sigmoid activation, is a type of activation function commonly used in artificial neural networks, especially in binary classification problems. It maps any real-valued number to a value between 0 and 1.
> scl <- function(x){(x-min(x))/(max(x)-min(x))}
We only need to transform all columns except species column.
> sclPenguin <- data.frame(lapply(penguin_data[,1:6],scl))
Dividing Training and Testing Data
We will divide the data in a ration of 80:20 (80% for training and 20% for testing) randomly proportionally to each species category. Since we have 274 data, each category will have 137 observations. Therefore, 20% of 137 which is around 27 observations will be used as testing data.
> testAdelie <- sample(c(1:137), 27, replace = FALSE, prob = NULL)
> testGentoo <- sample(c(138:274), 27, replace = FALSE, prob = NULL)
Next, we will combine the testing data
> nTest <- c(testAdelie, testGentoo)
After that, we will define the training data which is obtained from
sclPenguin without nTest observations, and
species column from penguin_data
> trainPenguin <- cbind(sclPenguin[-nTest,], penguin_data[-nTest, 7])
We will give names to each column on trainPenguin
> names(trainPenguin) <- c("island", "bill_length_mm", "bill_depth_mm","flipper_length_mm", "body_mass_g", "year", "species")
Next, we’ll also define and rename each columns on testing data
> testPenguin <- cbind(sclPenguin[c(nTest),], penguin_data[c(nTest), 7])
> names(testPenguin) <- c("island", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "year", "species")
In this project, we’ll use 1 hidden layer. To count how many neurons in the layer, we will use a formula according to Masters (1993) that follow the geometric pyramid rule.
\[ n_{z} = \sqrt{n_{x}n_{y}} \]
where \(n_{x}\) is the number of
predictor variable. We have 6 of them
(island bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g, and
year). And \(n_{y}\) is
the number of category in dependent variable. We have 2 species category
(Adelie or Gentoo)
Hence, we get a result of around 3 neurons
> NNPenguin = neuralnet(species~island+bill_length_mm+bill_depth_mm+flipper_length_mm+body_mass_g+year, data = penguin_data, hidden = 3, act.fct = "logistic", linear.output = FALSE)
>
> plot(NNPenguin)
>
> weights(NNPenguin)
[[1]]
[[1]][[1]]
[,1] [,2] [,3]
[1,] -0.58594952 -0.3787971 0.36055868
[2,] 0.84716765 -55.1852304 -49.51500175
[3,] 0.04711323 1.9290804 0.20921764
[4,] -1.49603189 -0.7622653 0.21538918
[5,] -0.37645798 1.2188321 0.05572218
[6,] 0.41204889 0.7719875 0.78172280
[7,] 0.92711394 -1.8200334 -1.52052992
[[1]][[2]]
[,1] [,2]
[1,] 2.565182 -1.767841
[2,] 2.499890 -3.099904
[3,] -4.030597 4.032266
[4,] -3.824393 3.626409
> ptestPenguin <- predict(NNPenguin, testPenguin[,-7])
> head(ptestPenguin)
[,1] [,2]
56 0.88762018 0.16912983
21 0.95905341 0.07602764
128 0.97927838 0.03330979
109 0.05287780 0.96631031
42 0.03922971 0.97500044
29 0.96892202 0.05456578
>
> prediksiPenguin <- data.frame(max.col(ptestPenguin))
> prediksi <- recode(prediksiPenguin$max.col.ptestPenguin., '1'="Adelie", '2'="Gentoo")
> head(prediksi)
[1] "Adelie" "Adelie" "Adelie" "Gentoo" "Gentoo" "Adelie"
> prediksi <- factor(prediksi, levels = levels(factor(testPenguin[, 7])))
> confusionMatrix(factor(prediksi), factor(testPenguin[,7]))
Confusion Matrix and Statistics
Reference
Prediction Adelie Gentoo
Adelie 30 2
Gentoo 3 19
Accuracy : 0.9074
95% CI : (0.797, 0.9692)
No Information Rate : 0.6111
P-Value [Acc > NIR] : 1.097e-06
Kappa : 0.8069
Mcnemar's Test P-Value : 1
Sensitivity : 0.9091
Specificity : 0.9048
Pos Pred Value : 0.9375
Neg Pred Value : 0.8636
Prevalence : 0.6111
Detection Rate : 0.5556
Detection Prevalence : 0.5926
Balanced Accuracy : 0.9069
'Positive' Class : Adelie