9285 train, 375 test, 375 validation images 224 X 224 X 3 jpg format
The dataset is taken from Kaggle https://www.kaggle.com/datasets/gpiosenka/butterfly-images40-species.
Description of Dataset:
The dataset contains butterfly images of 75 species. We are going to use CNN to classify and predict the species of butterflies.
Train, Test. Validation data set for 75 butterfly species. All images are 224 X 224 X 3 in jpg format.Train set consists of 9285 images partitioned into 75 sub directories one for each species.Test set consists of 375 images partitioned into 75 sub directories with 5 test images per species.Valid set consists of 375 images partitioned into 750 sub directories with 5 validation images per species.
Note: Please extract dataset archive.zip to the project folder(For example “C:’'CIS8392” is the project folder in my VM) and run the below code.
Loading the required packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.3 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(imager)
## Warning: package 'imager' was built under R version 4.1.3
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
##
## add
## The following object is masked from 'package:stringr':
##
## boundary
## The following object is masked from 'package:tidyr':
##
## fill
## The following objects are masked from 'package:stats':
##
## convolve, spectrum
## The following object is masked from 'package:graphics':
##
## frame
## The following object is masked from 'package:base':
##
## save.image
library(keras)
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Screenshot to show that no R code is available on Kaggle on this dataset
Image_RCode <- "https://www.linkpicture.com/q/ImageProof_With_No_RCode.png"
knitr::include_url(Image_RCode)
Sample images of butterfly species:
set.seed(123)
folder_list <- list.files("train/")
folder_path <- paste0("train/", folder_list, "/")
file_name <- map(folder_path, function(x) paste0(x, list.files(x))) %>% unlist()
sample_image <- sample(file_name, 6)
img <- map(sample_image, load.image)
par(mfrow = c(2, 3)) # Create 2 x 3 image grid
map(img, plot)
## [[1]]
## Image. Width: 224 pix Height: 224 pix Depth: 1 Colour channels: 3
##
## [[2]]
## Image. Width: 224 pix Height: 224 pix Depth: 1 Colour channels: 3
##
## [[3]]
## Image. Width: 224 pix Height: 224 pix Depth: 1 Colour channels: 3
##
## [[4]]
## Image. Width: 224 pix Height: 224 pix Depth: 1 Colour channels: 3
##
## [[5]]
## Image. Width: 224 pix Height: 224 pix Depth: 1 Colour channels: 3
##
## [[6]]
## Image. Width: 224 pix Height: 224 pix Depth: 1 Colour channels: 3
Getting the width, height, depth and channels of Image
img <- load.image(file_name[1])
dim(img)
## [1] 224 224 1 3
1. Feature Learning - The Feature Learning part mainly consist of Convolutional Layers and Pooling Layers.
2. Classification - The classification part is responsible for classifying the images to their respective categories based on the features (Feature Maps) that Feature Learning part has extracted (created) from the image. The Classification part usually consist of a Flatten Layer and a network of Fully Connected Hidden Layers.
The Convolution Layer is responsible for decomposing the Image into various feature maps using different kernels.
The next layer in our network is the Pooling Layer which is responsible for approximating the feature maps that were created using the Convolutional layers.
filters — Number of different filters (feature detectors) that will be applied on the original image to create feature maps. Different types of filters are Edge Detection Filter, Blur Filter etc.
kernel_size — Dimension of the convolution filter (n x n) matrix.
Activation — The activation function for the neurons. A general thumb of rule is to use a Rectifier Linear Unit (Relu) function as an activation function for every layer besides the output layer. The Relu function also adds non linearity to our network which is highly required to eliminate any linear relationships that does exist in the feature maps.
Input Layer — Takes the shape of the Input Images and number of channels (3 for color and 1 for B/W image).
pool_size — Dimension of pooling matrix (m x m)
The Flatten layer is used to convert the 2D output array from Pooling Layer or Convolutional layer to 1D array (Flattening the input) before feeding it to the fully connected layers.
The fully connected layers are a network of serially connected dense layers that would be used for classification.
The final layer in our CNN is the output layer. The number of Neurons in this layer is 1 (for regression and binary classification) or equal to number of distinct classes in a multiclass classification task.
After adding the output layer to the network, the next step is to compile the model and then train it on a training dataset.
1.optimizer → Algorithm used for updating the weights of our CNN. “Adam” (Gradient Descent) is one of the popular optimizer used for updating weights.
2.loss → Cost function used for calculating the error between the predicted & actual value. In our case we will be using “categorical_crossentropy” since we are dealing with multiclass classification. In case of binary classification we have to use “binary_crossentropy” as loss function.
3.metrics → Evaluation metric for checking performance of our model.
1.batch_size → Number of images that will be used by to train our CNN model before updating the weights using back propagation.
2.epochs → An epoch is a measure of the number of times all of the training images are used once to update the weights.
Using image_data_generator, we are re-scaling our images (every pixel value becomes a value between 0 & 1).
train_datagen <- image_data_generator(rescale = 1/255)
validation_datagen <- image_data_generator(rescale = 1/255)
test_datagen <- image_data_generator(rescale = 1/255)
Generating Training, Validation and Testing data
train_generator <- flow_images_from_directory(
directory="train/", # Target directory
generator = train_datagen, # Training data generator
target_size = c(224, 224), # Resizes all images to 224 × 224
batch_size = 20, # 20 samples in one batch
seed = 123,
class_mode = "categorical" # Because we use categorical_crossentropy loss
)
validation_generator <- flow_images_from_directory(
directory="valid/", # Target directory
generator = validation_datagen, # Training data generator
target_size = c(224, 224), # Resizes all images to 224 × 224
batch_size = 20, # 20 samples in one batch
seed = 123,
class_mode = "categorical" # Because we use categorical_crossentropy loss
)
test_generator <- flow_images_from_directory(
directory="test/", # Target directory
generator = test_datagen, # Training data generator
target_size = c(224, 224), # Resizes all images to 224 × 224
batch_size = 20, # 20 samples in one batch
seed = 123,
class_mode = "categorical" # Because we use categorical_crossentropy loss
)
Loading the already trained model in R (Which took around 100 minutes)
model_file = "final_model.h5"
history_file = "model_history.rds"
model_v2 <- load_model_hdf5(model_file)
history_v2 <- read_rds(history_file)
plot(history_v2)
## `geom_smooth()` using formula 'y ~ x'
Original classification values of test data
test_data <- data.frame(file_name = paste0("test/", validation_generator$filenames)) %>%
mutate(class = str_extract(file_name, "ADONIS|AFRICAN GIANT SWALLOWTAIL|AMERICAN SNOOT|AN 88|APPOLLO|ATALA|BANDED ORANGE HELICONIAN|BANDED PEACOCK|BECKERS WHITE|BLACK HAIRSTREAK|BLUE MORPHO|BLUE SPOTTED CROW|BROWN SIPROETA|CABBAGE WHITE|CAIRNS BIRDWING|CHECQUERED SKIPPER|CHESTNUT|CLEOPATRA|CLODIUS PARNASSIAN|CLOUDED SULPHUR|COMMON BANDED AWL|COMMON WOOD-NYMPH|COPPER TAIL|CRECENT|CRIMSON PATCH|DANAID EGGFLY|EASTERN COMA|EASTERN DAPPLE WHITE|EASTERN PINE ELFIN|ELBOWED PIERROT|GOLD BANDED|GREAT EGGFLY|GREAT JAY|GREEN CELLED CATTLEHEART|GREY HAIRSTREAK|INDRA SWALLOW|IPHICLUS SISTER|JULIA|LARGE MARBLE|MALACHITE|MANGROVE SKIPPER|MESTRA|METALMARK|MILBERTS TORTOISESHELL|MONARCH|MOURNING CLOAK|ORANGE OAKLEAF|ORANGE TIP|ORCHARD SWALLOW|PAINTED LADY|PAPER KITE|PEACOCK|PINE WHITE|PIPEVINE SWALLOW|POPINJAY|PURPLE HAIRSTREAK|PURPLISH COPPER|QUESTION MARK|RED ADMIRAL|RED CRACKER|RED POSTMAN|RED SPOTTED PURPLE|SCARCE SWALLOW|SILVER SPOT SKIPPER|SLEEPY ORANGE|SOOTYWING|SOUTHERN DOGFACE|STRAITED QUEEN|TROPICAL LEAFWING|TWO BARRED FLASHER|ULYSES|VICEROY|WOOD SATYR|YELLOW SWALLOW TAIL|ZEBRA LONG WING"))
head(test_data, 30)
## file_name class
## 1 test/ADONIS\\1.jpg ADONIS
## 2 test/ADONIS\\2.jpg ADONIS
## 3 test/ADONIS\\3.jpg ADONIS
## 4 test/ADONIS\\4.jpg ADONIS
## 5 test/ADONIS\\5.jpg ADONIS
## 6 test/AFRICAN GIANT SWALLOWTAIL\\1.jpg AFRICAN GIANT SWALLOWTAIL
## 7 test/AFRICAN GIANT SWALLOWTAIL\\2.jpg AFRICAN GIANT SWALLOWTAIL
## 8 test/AFRICAN GIANT SWALLOWTAIL\\3.jpg AFRICAN GIANT SWALLOWTAIL
## 9 test/AFRICAN GIANT SWALLOWTAIL\\4.jpg AFRICAN GIANT SWALLOWTAIL
## 10 test/AFRICAN GIANT SWALLOWTAIL\\5.jpg AFRICAN GIANT SWALLOWTAIL
## 11 test/AMERICAN SNOOT\\1.jpg AMERICAN SNOOT
## 12 test/AMERICAN SNOOT\\2.jpg AMERICAN SNOOT
## 13 test/AMERICAN SNOOT\\3.jpg AMERICAN SNOOT
## 14 test/AMERICAN SNOOT\\4.jpg AMERICAN SNOOT
## 15 test/AMERICAN SNOOT\\5.jpg AMERICAN SNOOT
## 16 test/AN 88\\1.jpg AN 88
## 17 test/AN 88\\2.jpg AN 88
## 18 test/AN 88\\3.jpg AN 88
## 19 test/AN 88\\4.jpg AN 88
## 20 test/AN 88\\5.jpg AN 88
## 21 test/APPOLLO\\1.jpg APPOLLO
## 22 test/APPOLLO\\2.jpg APPOLLO
## 23 test/APPOLLO\\3.jpg APPOLLO
## 24 test/APPOLLO\\4.jpg APPOLLO
## 25 test/APPOLLO\\5.jpg APPOLLO
## 26 test/ATALA\\1.jpg ATALA
## 27 test/ATALA\\2.jpg ATALA
## 28 test/ATALA\\3.jpg ATALA
## 29 test/ATALA\\4.jpg ATALA
## 30 test/ATALA\\5.jpg ATALA
Predicting the classification class of test data using the built model
folder_list_test <- list.files("test/")
folder_path_test <- paste0("test/", folder_list_test, "/")
# Function to convert image to array
image_prep <- function(x) {
arrays <- lapply(x, function(folder_path_test) {
img <- image_load(folder_path_test, target_size = c(224, 224),
grayscale = F # Set FALSE if image is RGB
)
x <- image_to_array(img)
x <- array_reshape(x, c(1, dim(x)))
x <- x/255 # rescale image pixel
})
do.call(abind::abind, c(arrays, list(along = 1)))
}
test_x <- image_prep(test_data$file_name)
# Check dimension of testing data set
#dim(test_x)
pred_test <- predict_classes(model_v2, test_x)
pred_test
## [1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
## [26] 5 42 42 5 5 6 6 6 6 37 7 7 7 7 7 8 4 8 4 38 9 68 9 9 9
## [51] 10 10 32 10 0 11 11 11 11 11 12 20 12 12 12 13 13 13 18 13 14 14 14 32 14
## [76] 23 23 23 15 23 25 16 6 49 16 64 17 64 64 64 18 18 18 18 4 19 19 27 66 66
## [101] 20 20 20 15 20 65 21 21 21 21 22 22 20 56 23 2 23 66 56 23 24 33 24 24 24
## [126] 25 25 25 25 51 26 26 26 26 23 27 13 38 47 29 28 28 63 28 15 36 29 29 29 29
## [151] 30 30 30 30 30 23 42 39 43 11 32 32 32 32 32 33 33 33 33 33 34 34 34 34 61
## [176] 35 35 35 35 35 36 36 36 63 12 37 37 37 37 37 27 38 38 19 47 32 39 39 39 74
## [201] 40 40 20 20 40 41 41 31 66 41 42 42 42 42 42 43 55 43 43 43 44 44 44 44 44
## [226] 45 45 45 65 45 46 46 41 68 46 47 47 47 47 47 4 48 48 48 53 49 49 15 49 49
## [251] 50 50 50 50 50 51 51 6 51 51 52 52 8 55 18 53 53 53 53 33 54 54 54 54 54
## [276] 42 70 55 55 10 56 12 64 58 27 26 57 57 26 26 58 58 49 58 58 59 59 50 24 59
## [301] 33 60 60 2 60 61 61 61 53 61 62 62 62 62 62 63 63 63 63 63 64 19 19 64 64
## [326] 31 45 65 65 65 66 66 66 66 66 71 25 45 37 71 45 68 20 68 56 69 69 69 69 69
## [351] 70 70 42 70 70 44 71 71 71 71 72 72 21 72 72 73 73 35 23 73 25 74 74 74 74
To get easier interpretation of the prediction, we will convert the encoding into proper class label (Just considering 10 species to represent here)
decode <- function(x){
case_when(x == 0 ~ "ADONIS",
x == 1 ~ "AFRICAN GIANT SWALLOWTAIL",
x == 2 ~ "AMERICAN SNOOT",
x == 3 ~ "AN 88",
x == 4 ~ "APPOLLO",
x == 5 ~ "ATALA",
x == 6 ~ "BANDED ORANGE HELICONIAN",
x == 7 ~ "BANDED PEACOCK",
x == 8 ~ "BECKERS WHITE",
x == 9 ~ "BLACK HAIRSTREAK"
)
}
pred_test <- sapply(pred_test, decode)
head(pred_test, 10)
## [1] "ADONIS" "ADONIS"
## [3] "ADONIS" "ADONIS"
## [5] "ADONIS" "AFRICAN GIANT SWALLOWTAIL"
## [7] "AFRICAN GIANT SWALLOWTAIL" "AFRICAN GIANT SWALLOWTAIL"
## [9] "AFRICAN GIANT SWALLOWTAIL" "AFRICAN GIANT SWALLOWTAIL"
Evaluating the model accuracy using test data
model_v2 %>% evaluate_generator(test_generator, steps = 50)
## $loss
## [1] 1.751759
##
## $accuracy
## [1] 0.7151515
Dataset: Downloaded and extracted the dataset from Kaggle.
Data Pre-Processing: Perfromed re-scaling and generated the train, test, validation data.
Model Architecture: Build the model using Convolutional layer, Max Pooling layer, Flattening layer, Dense layer.
Model Fitting: We fitted the data to the model using fit_generator function.
Model Evaluation: Evaluated the model using the test data and achieved accurracy of ~71.5% which suggests that the model accurately classifies 71 out of 100 images.
Initially fitted the model with epoch as 10 and achieved less accuracy and high loss.
To attain high accuracy, increased the epoch value to 50 and encountered overfitting issue.
To avoid overfitting issue, I have added a dropout layer to make 20% of weights to zero and added extra dense layer with units as 128.
Overall, Image classification model for 75 butterfly species is successful with good accuracy.