AUTHENTIC MACHINE LEARNING SUBMISSIONS

knitr::opts_chunk$set(echo = TRUE)
options(repos = c(CRAN = "https://cloud.r-project.org/"))

ABOUT THE DATASET

ABOUT THE DATASET My data set contains 293 images divided into three different categories which are plants,animals and fruits. #LOADING REQUIRED LIBRARIES The libraries used in the overall algorithms include: • Imagers are used for image processing and analysis, including picture loading, editing, and filtering. It is capable of doing operations like cropping, resizing, and improving photos. • Dplyr provides data manipulation utilities such as filtering, selecting, and modifying data frames. It makes data handling easier by using easy pipeline syntax. • Tidyverse is a set of R programs specifically developed for data science. It combines tools for importing, cleaning, displaying, and modeling data, increasing workflow efficiency. • BiocManager::EBImage is used to manage and analyze picture data in biological environments. It allows for tasks such as segmentation, feature extraction, and picture visualization in bioinformatics applications. • Caret improves machine learning by speeding model training, tweaking, and validation. It includes routines for data preparation, model fitting, and performance evaluation. • Nnet is a software for fitting feed-forward neural networks. It can do classification and regression tasks by training neural network models. • Cluster: Supports clustering methods such as k-means and hierarchical clustering, among others. It is useful for unsupervised learning, which involves recognizing data groups or patterns.

library(imager)

## Loading required package: magrittr

## 
## Attaching package: 'imager'

## The following object is masked from 'package:magrittr':
## 
##     add

## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum

## The following object is masked from 'package:graphics':
## 
##     frame

## The following object is masked from 'package:base':
## 
##     save.image

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:imager':
## 
##     where

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ imager::add()       masks magrittr::add()
## ✖ stringr::boundary() masks imager::boundary()
## ✖ tidyr::extract()    masks magrittr::extract()
## ✖ tidyr::fill()       masks imager::fill()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ purrr::set_names()  masks magrittr::set_names()
## ✖ dplyr::where()      masks imager::where()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

install.packages("BiocManager")

## 
## The downloaded binary packages are in
##  /var/folders/y_/0h741v3x4gj8bhh5rvch7plm0000gn/T//Rtmp8XIFag/downloaded_packages

BiocManager::install("EBImage")

## 'getOption("repos")' replaces Bioconductor standard repositories, see
## 'help("repositories", package = "BiocManager")' for details.
## Replacement repositories:
##     CRAN: https://cloud.r-project.org/
## Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

## Warning: package(s) not installed when version(s) same as or greater than current; use
##   `force = TRUE` to re-install: 'EBImage'

## Old packages: 'corrplot', 'data.table', 'dendextend', 'doBy', 'emmeans',
##   'evaluate', 'gtable', 'igraph', 'Matrix', 'quantreg', 'rstudioapi', 'slider'

##LOADING THE DATASET The image_path defines the path of the images/data in our computer and the categories and folders are used to define the label names

# Set the path to your dataset
image_path <- "/Users/lavasai/Desktop/R-ASSIGNMENT DATA"

# Get a list of all categories (folders)
categories <- list.dirs(image_path, recursive = FALSE)
labels <- basename(categories)  # Use folder names as labels

# Function to resize images
resize_image <- function(img_path, img_size = 32) {
  img <- load.image(img_path)  # Load the image
  img <- resize(img, img_size, img_size)  # Resize to specified dimensions
  as.numeric(img)  # Convert to a numeric vector
}

# Function to load all images and create a dataset
load_images <- function(image_path, categories) {
  image_data <- data.frame()  # Initialize an empty data frame
  
  for (category in categories) {
    label <- basename(category)  # Get the label from the folder name
    image_files <- list.files(category, full.names = TRUE)  # List all image files in the category
    
    for (image_file in image_files) {
      img_vector <- resize_image(image_file)  # Resize the image and convert to a vector
      image_data <- rbind(image_data, data.frame(label = label, img_vector = I(list(img_vector))))  # Add to data frame
    }
  }
  
  return(image_data)  # Return the populated data frame
}

# Load the images into a data frame
image_data <- load_images(image_path, categories)

# Check the dimensions of the image data
dim(image_data)  # Should show the number of loaded images and columns

## [1] 293   2

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

#UNDERSTANDING OUR DATASET UNDERSTANDING OUR DATASET • Here’s a quick overview of each step: Label Distribution Check: Use table(image_data_flat\(label) to look for imbalances in the label column, since one class may dominate and cause convergence difficulties during model training. • Visualize Feature Relationships: Use scatter plots or pairs (pairs(image_data_flat[, c(2:5)], col = image_data_flat\)label)) to see how features and labels interact, which can aid in identifying patterns or class separation. • To remove strongly associated characteristics, calculate the correlation matrix (cor_matrix <- cor(image_data_flat[-1])) and identify them. Then, using findCorrelation(), eliminate features with correlations over a specified threshold (e.g., 0.9) to avoid multicollinearity, which might skew the model. • PCA for Dimensionality Reduction: After eliminating strongly correlated features, use prcomp() to further decrease dimensionality. Choose enough components to keep the majority of the variance (e.g., 95%), decreasing the feature space while maintaining the most essential data. • Logistic Regression using PCA Components: Create a logistic regression model (glm()) with the principle components from PCA. This model predicts the label using the reduced feature set, resulting in a more compact and effective representation of the dataset.

table(image_data$label)

## 
## animals  fruits  plants 
##     105     105      83

# Check the structure of the data
str(image_data)

## 'data.frame':    293 obs. of  2 variables:
##  $ label     : chr  "animals" "animals" "animals" "animals" ...
##  $ img_vector:List of 293
##   ..$ : num  0.71 0.71 0.71 0.71 0.71 ...
##   ..$ : num  0.616 0.616 0.62 0.624 0.639 ...
##   ..$ : num  0.1294 0.1216 0.2824 0.0745 0.3451 ...
##   ..$ : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..$ : num  0.051 0.0824 0.0549 0.0549 0.0745 ...
##   ..$ : num  0.482 0.49 0.486 0.49 0.49 ...
##   ..$ : num  0.4 0.424 0.455 0.533 0.557 ...
##   ..$ : num  0.765 0.8 0.792 0.749 0.773 ...
##   ..$ : num  1 1 1 1 0.961 ...
##   ..$ : num  0.404 0.267 0.145 0.243 0.204 ...
##   ..$ : num  0.737 0.537 0.792 0.58 0.678 ...
##   ..$ : num  0.514 0.318 0.506 0.682 0.941 ...
##   ..$ : num  0.925 0.886 0.882 0.706 0.733 ...
##   ..$ : num  0.655 0.655 0.663 0.659 0.655 ...
##   ..$ : num  0.188 0.408 0.502 0.153 0.302 ...
##   ..$ : num  0.859 0.878 0.902 0.918 0.898 ...
##   ..$ : num  0.765 0.808 0.847 0.875 0.882 ...
##   ..$ : num  0.2 0.427 0.388 0.831 0.973 ...
##   ..$ : num  0.706 0.71 0.722 0.733 0.737 ...
##   ..$ : num  0.396 0.404 0.451 0.439 0.431 ...
##   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
##   ..$ : num  0.843 0.949 0.973 0.949 0.957 ...
##   ..$ : num  0.675 0.706 0.722 0.733 0.749 ...
##   ..$ : num  0.3922 0.0863 0.1098 0.0392 0.1882 ...
##   ..$ : num  0.671 0.624 0.447 0.529 1 ...
##   ..$ : num  0.914 0.925 0.878 0.875 0.894 ...
##   ..$ : num  0.463 0.533 0.557 0.58 0.706 ...
##   ..$ : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..$ : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..$ : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..$ : num  0.612 0.643 0.655 0.678 0.698 ...
##   ..$ : num  0.706 0.533 0.741 0.533 0.694 ...
##   ..$ : num  0.663 0.651 0.616 0.608 0.58 ...
##   ..$ : num  0.2471 0.0196 0.0431 0.0431 0.0314 ...
##   ..$ : num  0.769 0.808 0.82 0.788 0.765 ...
##   ..$ : num  0.804 0.82 0.824 0.82 0.816 ...
##   ..$ : num  0.0902 0.1451 0.1961 0.2275 0.2431 ...
##   ..$ : num  0.729 0.733 0.729 0.741 0.729 ...
##   ..$ : num  0.953 0.953 0.953 0.953 0.953 ...
##   ..$ : num  0.431 0.396 0.714 0.769 0.678 ...
##   ..$ : num  0.51 0.494 0.525 0.553 0.553 ...
##   ..$ : num  0.475 0.157 0.176 0.141 0.141 ...
##   ..$ : num  0.165 0.145 0.169 0.247 0.2 ...
##   ..$ : num  0.659 0.639 0.616 0.576 0.561 ...
##   ..$ : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..$ : num  0.451 0.486 0.616 0.482 0.604 ...
##   ..$ : num  0.639 0.545 0.639 0.6 0.718 ...
##   ..$ : num  0.369 0.275 0.267 0.208 0.255 ...
##   ..$ : num  0.514 0.529 0.557 0.561 0.533 ...
##   ..$ : num  0.482 0.506 0.518 0.482 0.467 ...
##   ..$ : num  0.792 0.776 0.796 0.824 0.824 ...
##   ..$ : num  0.671 0.667 0.651 0.647 0.627 ...
##   ..$ : num  0.482 0.506 0.518 0.482 0.467 ...
##   ..$ : num  0.333 0.761 0.655 0.753 0.42 ...
##   ..$ : num  0.886 0.957 0.863 0.894 0.792 ...
##   ..$ : num  0.333 0.263 0.384 0.365 0.275 ...
##   ..$ : num  0.0314 0.0706 0.0784 0.098 0.0745 ...
##   ..$ : num  0.522 0.514 0.51 0.514 0.514 ...
##   ..$ : num  0.808 0.808 0.796 0.812 0.804 ...
##   ..$ : num  0.482 0.533 0.565 0.576 0.596 ...
##   ..$ : num  0.929 0.929 0.925 0.925 0.922 ...
##   ..$ : num  0.141 0.122 0.341 0.322 0.353 ...
##   ..$ : num  0.329 0.365 0.227 0.255 0.498 ...
##   ..$ : num  0.0784 0.0902 0.0863 0.0392 0.1765 ...
##   ..$ : num  0.941 0.914 0.839 0.875 0.957 ...
##   ..$ : num  0.0118 0.0902 0.051 0.1961 0.2353 ...
##   ..$ : num  0.655 0.635 0.698 0.659 0.694 ...
##   ..$ : num  0.263 0.478 0.337 0.294 0.345 ...
##   ..$ : num  0.957 0.957 0.941 0.941 0.941 ...
##   ..$ : num  0.1216 0.0745 0.0627 0.098 0.1294 ...
##   ..$ : num  0.051 0.0745 0.1961 0.1686 0 ...
##   ..$ : num  0.0706 0.051 0.0667 0.0667 0.0549 ...
##   ..$ : num  0.769 0.694 0.757 0.608 0.506 ...
##   ..$ : num  0.62 0.647 0.655 0.588 0.596 ...
##   ..$ : num  0.482 0.498 0.494 0.486 0.475 ...
##   ..$ : num  0.643 0.635 0.624 0.541 0.729 ...
##   ..$ : num  0.349 0.349 0.349 0.349 0.349 ...
##   ..$ : num  0.51 0.553 0.537 0.569 0.557 ...
##   ..$ : num  0.922 0.431 0.467 0.329 0.435 ...
##   ..$ : num  0.294 0.318 0.322 0.345 0.345 ...
##   ..$ : num  0.996 0.996 0.996 0.996 0.996 ...
##   ..$ : num  0.678 0.655 0.678 0.702 0.804 ...
##   ..$ : num  0.294 0.322 0.306 0.329 0.322 ...
##   ..$ : num  0.612 0.537 0.62 0.635 0.671 ...
##   ..$ : num  0.231 0.337 0.357 0.345 0.329 ...
##   ..$ : num  0.259 0.286 0.298 0.306 0.286 ...
##   ..$ : num  0.776 0.776 0.78 0.784 0.784 ...
##   ..$ : num  0.859 0.843 0.82 0.816 0.824 ...
##   ..$ : num  0.145 0.18 0.278 0.325 0.412 ...
##   ..$ : num  0.631 0.565 0.557 0.682 0.8 ...
##   ..$ : num  0.42 0.694 0.788 0.82 0.847 ...
##   ..$ : num  0.5804 0.0941 0.0863 0.1255 0.1451 ...
##   ..$ : num  0.518 0.314 0.318 0.306 0.282 ...
##   ..$ : num  0.471 0.412 0.404 0.373 0.376 ...
##   ..$ : num  0.984 0.984 0.984 0.984 0.984 ...
##   ..$ : num  0.259 0.263 0.259 0.267 0.294 ...
##   ..$ : num  0.212 0.29 0.278 0.243 0.231 ...
##   ..$ : num  0.392 0.365 0.376 0.404 0.435 ...
##   ..$ : num  0.22 0.192 0.247 0.259 0.235 ...
##   .. [list output truncated]
##   ..- attr(*, "class")= chr "AsIs"

# Display the names of the columns
colnames(image_data)

## [1] "label"      "img_vector"

# Print the first few rows of the dataset
head(image_data)

##     label   img_vector
## 1 animals 0.709803....
## 2 animals 0.615686....
## 3 animals 0.129411....
## 4 animals 1, 1, 1,....
## 5 animals 0.050980....
## 6 animals 0.482352....

#LOGISTIC REGRESSION

# Load necessary libraries
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

install.packages("nnet")

## 
## The downloaded binary packages are in
##  /var/folders/y_/0h741v3x4gj8bhh5rvch7plm0000gn/T//Rtmp8XIFag/downloaded_packages

library(nnet)

# Flatten the img_vector column into multiple pixel columns
img_matrix <- do.call(rbind, lapply(image_data$img_vector, as.vector))

# Create a new data frame with flattened image data
image_data_flat <- data.frame(label = image_data$label, img_matrix)

# Check the new dimensions
print(dim(image_data_flat))  # Should show the number of loaded images and the number of pixel columns

## [1]  293 3073

# Convert labels to a factor if they are not already
image_data_flat$label <- as.factor(image_data_flat$label)

# Normalize the features
image_data_scaled <- scale(image_data_flat[, -which(names(image_data_flat) == "label")])

# Perform PCA
pca_result <- prcomp(image_data_scaled, center = TRUE, scale. = TRUE)

# Decide the number of components to keep (e.g., 95% variance)
explained_variance <- summary(pca_result)$importance[3,]
num_components <- min(which(cumsum(explained_variance) >= 0.95))

# Create a new dataset with the PCA components
image_data_pca <- data.frame(pca_result$x[, 1:num_components])
image_data_pca$label <- image_data_flat$label

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(image_data_pca$label, p = 0.7, list = FALSE)
train_data <- image_data_pca[train_index, ]
test_data <- image_data_pca[-train_index, ]

# Fit the logistic regression model on training data
logistic_model <- multinom(label ~ ., data = train_data)  # Using multinom for logistic regression in multi-class case

## # weights:  15 (8 variable)
## initial  value 227.412744 
## iter  10 value 213.897313
## final  value 212.041051 
## converged

# Check the summary of the model
summary(logistic_model)

## Call:
## multinom(formula = label ~ ., data = train_data)
## 
## Coefficients:
##        (Intercept)          PC1         PC2         PC3
## fruits -0.04776802 -0.013841894 0.012447930 -0.07559319
## plants -0.14813573 -0.006387537 0.004781214 -0.03860267
## 
## Std. Errors:
##        (Intercept)         PC1        PC2        PC3
## fruits   0.1809852 0.005444998 0.01390418 0.01784523
## plants   0.1811920 0.005420994 0.01401076 0.01741646
## 
## Residual Deviance: 424.0821 
## AIC: 440.0821

# Predict the probabilities for the test dataset
predicted_probs <- predict(logistic_model, newdata = test_data, type = "prob")

# Predict the class labels for the test dataset
predicted_classes <- predict(logistic_model, newdata = test_data)

# View the predicted probabilities and class labels
head(predicted_probs)

##      animals    fruits    plants
## 1  0.2280612 0.4815151 0.2904237
## 2  0.2523976 0.4472858 0.3003166
## 3  0.4636600 0.2383975 0.2979425
## 10 0.4250963 0.2779970 0.2969067
## 11 0.4767388 0.2266579 0.2966033
## 19 0.2813812 0.4095644 0.3090544

head(predicted_classes)

## [1] fruits  fruits  animals animals animals fruits 
## Levels: animals fruits plants

# Create a confusion matrix for the test set
confusion_matrix <- table(Actual = test_data$label, Predicted = predicted_classes)

# View the confusion matrix
print(confusion_matrix)

##          Predicted
## Actual    animals fruits plants
##   animals      20     11      0
##   fruits       13     18      0
##   plants       15      9      0

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Test Accuracy:", accuracy))

## [1] "Test Accuracy: 0.441860465116279"

Interpretation of Logistic Regression: 1) ModelConvergence: After 10 rounds, the logistic regression model converged, suggesting that it had identified an ideal solution during training. 2) Coefficients: The model output provides coefficients for each class (fruits, plants) in comparison to the reference class (animals). These coefficients represent the impact of each main component (PC1, PC2, and PC3) on the likelihood that a picture belongs to a specific class. For example, fruits has a negative coefficient for PC1 (-0.0138), implying that greater PC1 values decrease the chance of a picture being categorized as fruits. 3) Residual Deviance and AIC: The ultimate residual deviation is 424.08, whereas the Akaike Information Criterion (AIC) is 440.08. Lower AIC values suggest higher model performance, although this value alone is meaningless until compared to other models. 4) Predicted probabilities: The projected probabilities for the test set indicate how likely each picture is to be categorized as fruits or animals. Predicted classes: The anticipated class labels for the test set indicate that the first image was categorized as fruits, the second as fruits, and so on. 5) Confusion Matrix: The confusion matrix shows that the model accurately categorized 20 photos as animals and 18 as fruits. However, the model struggled with the plant class, since none of the test photos were accurately identified as plants. Many plants were mistakenly identified as animals or fruits. 6) Test accuracy: The model’s total accuracy on the test set is 44.19%, indicating that it correctly identified the class of about 44% of the photos in the test set. This comparatively low accuracy indicates that the model has space for development.

#CLUSTERING

##Is the data clusterable

# Install and load the 'clustertend' package if you haven't already
install.packages("clustertend")

## 
## The downloaded binary packages are in
##  /var/folders/y_/0h741v3x4gj8bhh5rvch7plm0000gn/T//Rtmp8XIFag/downloaded_packages

library(clustertend)

## Package `clustertend` is deprecated.  Use package `hopkins` instead.

# Compute Hopkins statistic
set.seed(123)  # Set a seed for reproducibility
hopkins_stat <- hopkins(image_data_pca[, -which(names(image_data_pca) == "label")], n = nrow(image_data_pca) - 1)

## Warning in hopkins(image_data_pca[, -which(names(image_data_pca) == "label")],
## : Package `clustertend` is deprecated.  Use package `hopkins` instead.

print(paste("Hopkins Statistic: ", hopkins_stat))

## [1] "Hopkins Statistic:  0.344555313426453"

##HEIRARCHIAL CLUSTERING WITH DENDROGRAM:
# Compute the distance matrix
dist_matrix <- dist(image_data_pca[, -which(names(image_data_pca) == "label")])

# Perform hierarchical clustering
hclust_result <- hclust(dist_matrix)

# Plot the dendrogram
plot(hclust_result, labels = FALSE, main = "Hierarchical Clustering Dendrogram")

# Cut the dendrogram to create clusters
image_data_pca$hclust_cluster <- cutree(hclust_result, k = 3)  # Use the same number of clusters as above

# View the first few rows with cluster assignments
head(image_data_pca)

##          PC1        PC2       PC3   label hclust_cluster
## 1 -10.866958   5.560921 -7.612465 animals              1
## 2 -13.657629  -9.316067 -7.234492 animals              1
## 3  20.426505  -4.620733  3.666783 animals              2
## 4 -52.078210  -8.174305 30.292747 animals              3
## 5  33.094682 -24.521503 22.269496 animals              2
## 6   3.957336  -3.638699 -4.801238 animals              1

# Set the number of clusters (k)
set.seed(123)  # For reproducibility
k <- 4  # You can also specify a different number of clusters

str(image_data_pca)# Assuming kmeans has been applied, e.g., using kmeans_result$cluster

## 'data.frame':    293 obs. of  5 variables:
##  $ PC1           : num  -10.9 -13.7 20.4 -52.1 33.1 ...
##  $ PC2           : num  5.56 -9.32 -4.62 -8.17 -24.52 ...
##  $ PC3           : num  -7.61 -7.23 3.67 30.29 22.27 ...
##  $ label         : Factor w/ 3 levels "animals","fruits",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hclust_cluster: int  1 1 2 3 2 1 1 1 3 3 ...

kmeans_result <- kmeans(image_data_pca[,1:2], centers = 3) # Example with 3 clusters
image_data_pca$cluster <- as.factor(kmeans_result$cluster)  # Add cluster as a factor

# View the first few rows with cluster assignments
head(image_data_pca)

##          PC1        PC2       PC3   label hclust_cluster cluster
## 1 -10.866958   5.560921 -7.612465 animals              1       3
## 2 -13.657629  -9.316067 -7.234492 animals              1       3
## 3  20.426505  -4.620733  3.666783 animals              2       1
## 4 -52.078210  -8.174305 30.292747 animals              3       2
## 5  33.094682 -24.521503 22.269496 animals              2       1
## 6   3.957336  -3.638699 -4.801238 animals              1       3

# Optional: Visualize the clusters in a 2D plot (using the first two PCA components)
library(ggplot2)

ggplot(image_data_pca, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.6) +
  labs(title = "K-means Clustering of Image Data",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_minimal()

  # Check the structure of the dataset

Interpreting the Results of Clustering: 1) 2) 3) Hopkins’ Statistic: The calculated Hopkins statistic will return a numerical value representing the dataset’s clustering tendency. If the number is much more than 0.5, it indicates that the data has a strong clustering tendency and that clustering approaches are acceptable. If it’s near to 0.5, the data is equally distributed with no discernible clusters, suggesting that clustering may be ineffective. Hierarchical Clustering Dendrograms: The dendrogram depicts the results of hierarchical clustering, demonstrating how data points are grouped at different degrees of similarity. Cutting the dendrogram at k = 3 produces three separate clusters, as seen in the cluster assignments given to image_data_pca. Cluster Assignments: The first few rows of image_data_pca provide the cluster assignments for hierarchical clustering, showing which data points belong to each of the three clusters. 4) K-Means Clustering: K-means clustering with k = 3 was used, resulting in a new column cluster in image_data_pca that allocates each data point to one of the three groups. The clustering findings may be evaluated by observing the distribution of data points across the clusters. 5) Visualization: The scatter plot made using ggplot2 shows the first two principal components (PC1 and PC2), with points colored according on their cluster assignment. This graphic aids in illustrating the separation of clusters. If clusters are well-separated, it suggests that the K- means algorithm successfully grouped comparable data points together. If there is a lot of overlap, it might mean that the clustering has to be adjusted or that the data’s natural group structures aren’t well-defined. Overall, the results shed light on the clustering structure of the picture data, demonstrating if it organically creates discrete groups. ## #NEURAL NETWORKS

# Load necessary libraries
library(nnet)
library(caret)

# Set seed for reproducibility
set.seed(123)

# Function to fit a neural network model and return the accuracy
fit_nn_model <- function(hidden_neurons, train_data) {
  # Fit the neural network model
  nn_model <- nnet(label ~ ., data = train_data, size = hidden_neurons, maxit = 100, trace = FALSE)
  
  # Predict the class labels
  predicted_classes <- predict(nn_model, train_data, type = "class")
  
  # Create a confusion matrix
  confusion_matrix <- table(Actual = train_data$label, Predicted = predicted_classes)
  
  # Calculate accuracy
  accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
  
  return(accuracy)
}


# Fit models with different hidden layer sizes
neurons_list <- c(5, 10, 20)
accuracy_results <- data.frame(Hidden_Neurons = neurons_list, Accuracy = NA)

for (neurons in neurons_list) {
  accuracy <- fit_nn_model(neurons, image_data_pca)
  accuracy_results[accuracy_results$Hidden_Neurons == neurons, "Accuracy"] <- accuracy
}

# Print the results
print(accuracy_results)

##   Hidden_Neurons  Accuracy
## 1              5 0.5870307
## 2             10 0.6689420
## 3             20 0.7781570

# Printing the neurons list
print(neurons_list)

## [1]  5 10 20

Interpretation of Neural Networks: 1) 5 Hidden Neurons: The model has an accuracy of 58.70%. This shows that when the neural network had a simpler design, it was able to properly categorize around 58.7% of training data. 2) 10 Hidden Neurons: The accuracy increased to 66.89%. This shows that increasing the number of hidden neurons enabled the model to detect more complicated correlations in the data, resulting in improved performance. 3) 20 Hidden Neurons: The greatest recorded accuracy was 77.82%. This demonstrates that adding additional complexity to the model (more neurons) improved performance substantially, accurately categorizing roughly 78.2% of the training examples

AUTHENTIC MACHINE LEARNING SUBMISSIONS

LAVANTHI SUPRAJA

2024-10-26

#LOGISTIC REGRESSION

#CLUSTERING

##Is the data clusterable