Overview

Welcome to the Penguin Species Classification project! This repository contains the code and resources for a machine learning model that classifies penguin species based on a various features using a Multilayer Perceptron (MLP) Neural Network. The goal of this project is to demonstrate the application of deep learning techniques in identifying different species from collected data.

Features

  • Multilayer Perceptron (MLP): The core of this project is a Multilayer Perceptron, a type of artificial network known for its ability to handle complex patterns in data. The MLP is trained on a dataset of penguin features to predict the species of a given penguin.

  • Penguin Dataset: We will use a dataset that includes measurements such as bill length, bill depth, flipper length, and body mass. The dataset also includes the island where the penguin was found, and the year of each sample was founded. This dataset is essential for training and evaluating the performance of the MLP model.

  • Classification Accuracy: The model’s performance is evaluated based on classification accuracy, showcasing how well it generalizes to unseen penguin samples. The aim of this is to achieve a high accuracy in predicting the correct species of penguins.

Data Overview

The data was obtained from the Palmer Penguins for binary classification. The data consists of 274 rows and 7 columns.

Column Description
island Island where the penguin was found (Biscoe, Dream, or, Torgersen)
bill_length_mm Bill length in milimeters
bill_depth_mm Bill depth in milimeters
flipper_length_mm Flipper length in milimeters
body_mass_g Body mass in grams
year Year where the penguin was found
species Species of the penguin (Adelie or Gentoo)

Getting Started

To get started with this project, follow these steps:

  1. Library Used
> library(readxl)
> library(neuralnet)
> library(dplyr)
> library(caret)
  1. Import Data
> penguin_data = read_excel("C:/Users/acer/Downloads/penguin_binary_classification.xlsx")
  1. Reprocessing Data

    Since ‘island’ is a categorical data, we wil assign a unique numerical label to each island category.

> penguin_data$island <- as.numeric(factor(penguin_data$island))
  1. Data Transformation

    In this project we will use logistic activation function. This function, also known as the sigmoid activation, is a type of activation function commonly used in artificial neural networks, especially in binary classification problems. It maps any real-valued number to a value between 0 and 1.

> scl <- function(x){(x-min(x))/(max(x)-min(x))}

We only need to transform all columns except species column.

> sclPenguin <- data.frame(lapply(penguin_data[,1:6],scl))
  1. Dividing Training and Testing Data

    We will divide the data in a ration of 80:20 (80% for training and 20% for testing) randomly proportionally to each species category. Since we have 274 data, each category will have 137 observations. Therefore, 20% of 137 which is around 27 observations will be used as testing data.

> testAdelie <- sample(c(1:137), 27, replace = FALSE, prob = NULL)
> testGentoo <- sample(c(138:274), 27, replace = FALSE, prob = NULL)

Next, we will combine the testing data

> nTest <- c(testAdelie, testGentoo)

After that, we will define the training data which is obtained from sclPenguin without nTest observations, and species column from penguin_data

> trainPenguin <- cbind(sclPenguin[-nTest,], penguin_data[-nTest, 7])

We will give names to each column on trainPenguin

> names(trainPenguin) <- c("island",    "bill_length_mm",   "bill_depth_mm","flipper_length_mm",    "body_mass_g",  "year", "species")

Next, we’ll also define and rename each columns on testing data

> testPenguin <- cbind(sclPenguin[c(nTest),], penguin_data[c(nTest), 7])
> names(testPenguin) <- c("island", "bill_length_mm",   "bill_depth_mm", "flipper_length_mm",   "body_mass_g",  "year", "species")
  1. Count Hidden Layer and It’s Neurons

In this project, we’ll use 1 hidden layer. To count how many neurons in the layer, we will use a formula according to Masters (1993) that follow the geometric pyramid rule.

\[ n_{z} = \sqrt{n_{x}n_{y}} \]

where \(n_{x}\) is the number of predictor variable. We have 6 of them (island bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, and year). And \(n_{y}\) is the number of category in dependent variable. We have 2 species category (Adelie or Gentoo)

Hence, we get a result of around 3 neurons

  1. Algorithm
> NNPenguin = neuralnet(species~island+bill_length_mm+bill_depth_mm+flipper_length_mm+body_mass_g+year, data = penguin_data, hidden = 3, act.fct = "logistic", linear.output = FALSE)
> 
> plot(NNPenguin)
> 
> weights(NNPenguin)
[[1]]
[[1]][[1]]
            [,1]        [,2]         [,3]
[1,] -0.58594952  -0.3787971   0.36055868
[2,]  0.84716765 -55.1852304 -49.51500175
[3,]  0.04711323   1.9290804   0.20921764
[4,] -1.49603189  -0.7622653   0.21538918
[5,] -0.37645798   1.2188321   0.05572218
[6,]  0.41204889   0.7719875   0.78172280
[7,]  0.92711394  -1.8200334  -1.52052992

[[1]][[2]]
          [,1]      [,2]
[1,]  2.565182 -1.767841
[2,]  2.499890 -3.099904
[3,] -4.030597  4.032266
[4,] -3.824393  3.626409
  1. Prediction for Testing Data
> ptestPenguin <- predict(NNPenguin, testPenguin[,-7])
> head(ptestPenguin)
          [,1]       [,2]
56  0.88762018 0.16912983
21  0.95905341 0.07602764
128 0.97927838 0.03330979
109 0.05287780 0.96631031
42  0.03922971 0.97500044
29  0.96892202 0.05456578
> 
> prediksiPenguin <- data.frame(max.col(ptestPenguin))
> prediksi <- recode(prediksiPenguin$max.col.ptestPenguin., '1'="Adelie", '2'="Gentoo")
> head(prediksi)
[1] "Adelie" "Adelie" "Adelie" "Gentoo" "Gentoo" "Adelie"
  1. Classification Accuracy
> prediksi <- factor(prediksi, levels = levels(factor(testPenguin[, 7])))
> confusionMatrix(factor(prediksi), factor(testPenguin[,7]))
Confusion Matrix and Statistics

          Reference
Prediction Adelie Gentoo
    Adelie     30      2
    Gentoo      3     19
                                         
               Accuracy : 0.9074         
                 95% CI : (0.797, 0.9692)
    No Information Rate : 0.6111         
    P-Value [Acc > NIR] : 1.097e-06      
                                         
                  Kappa : 0.8069         
                                         
 Mcnemar's Test P-Value : 1              
                                         
            Sensitivity : 0.9091         
            Specificity : 0.9048         
         Pos Pred Value : 0.9375         
         Neg Pred Value : 0.8636         
             Prevalence : 0.6111         
         Detection Rate : 0.5556         
   Detection Prevalence : 0.5926         
      Balanced Accuracy : 0.9069         
                                         
       'Positive' Class : Adelie