Group Members

👩 Amira Hanee binti Saifulazri(17202918)
👩 Saidatul Hanida Binti Mohd Yukhi(22082961)
👧 Sarvinnah Kajandren(22051663)
👩 Farzana Syakira binti Bahari(22058163)
👧 Liyujie(17103144)

Title: Crab Age Prediction

Introduction

Crab growth plays a significant role in the seafood industry, particularly when it comes to determine the commercial value of crabs based on essential characteristics such as size, weight, and meat quality.

Additionally, gaining insights into the age of the crab population is crucial for fishers and seafood processors as it enables them to make well-informed decisions regarding harvesting, processing, and marketing strategies.

Objective

To predict the age of crabs based on the physical attributes with regression algorithms
To predict the age group of crabs based on the physical attributes with classification algorithms

Data Preparation

In this section, the processes include identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity.

Import all relevant libraries

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(corrplot)

## corrplot 0.92 loaded

library(e1071)
library(klaR)

## Loading required package: MASS

library(nnet)
library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(rpart)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(xgboost)

## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice

library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

Load the dataset from .csv file

crabage<- read.csv("CrabAgePrediction.csv")

summary(crabage)

##      Sex                Length          Diameter          Height      
##  Length:3893        Min.   :0.1875   Min.   :0.1375   Min.   :0.0000  
##  Class :character   1st Qu.:1.1250   1st Qu.:0.8750   1st Qu.:0.2875  
##  Mode  :character   Median :1.3625   Median :1.0625   Median :0.3625  
##                     Mean   :1.3113   Mean   :1.0209   Mean   :0.3494  
##                     3rd Qu.:1.5375   3rd Qu.:1.2000   3rd Qu.:0.4125  
##                     Max.   :2.0375   Max.   :1.6250   Max.   :2.8250  
##      Weight        Shucked.Weight     Viscera.Weight      Shell.Weight     
##  Min.   : 0.0567   Min.   : 0.02835   Min.   : 0.01418   Min.   : 0.04252  
##  1st Qu.:12.6722   1st Qu.: 5.34388   1st Qu.: 2.66485   1st Qu.: 3.71378  
##  Median :22.7930   Median : 9.53961   Median : 4.86194   Median : 6.66213  
##  Mean   :23.5673   Mean   :10.20734   Mean   : 5.13655   Mean   : 6.79584  
##  3rd Qu.:32.7862   3rd Qu.:14.27397   3rd Qu.: 7.20077   3rd Qu.: 9.35534  
##  Max.   :80.1015   Max.   :42.18406   Max.   :21.54562   Max.   :28.49125  
##       Age        
##  Min.   : 1.000  
##  1st Qu.: 8.000  
##  Median :10.000  
##  Mean   : 9.955  
##  3rd Qu.:11.000  
##  Max.   :29.000

dim(crabage)

## [1] 3893    9

str(crabage)

## 'data.frame':    3893 obs. of  9 variables:
##  $ Sex           : chr  "F" "M" "I" "F" ...
##  $ Length        : num  1.438 0.887 1.038 1.175 0.887 ...
##  $ Diameter      : num  1.175 0.65 0.775 0.887 0.662 ...
##  $ Height        : num  0.412 0.212 0.25 0.25 0.212 ...
##  $ Weight        : num  24.64 5.4 7.95 13.48 6.9 ...
##  $ Shucked.Weight: num  12.33 2.3 3.23 4.75 3.46 ...
##  $ Viscera.Weight: num  5.58 1.37 1.6 2.28 1.49 ...
##  $ Shell.Weight  : num  6.75 1.56 2.76 5.24 1.7 ...
##  $ Age           : int  9 6 6 10 6 8 15 10 13 7 ...

Check if there is any NA in the dataset

colSums(is.na(crabage))

##            Sex         Length       Diameter         Height         Weight 
##              0              0              0              0              0 
## Shucked.Weight Viscera.Weight   Shell.Weight            Age 
##              0              0              0              0

Check if there is any 0 value in the dataset

colSums(crabage==0)

##            Sex         Length       Diameter         Height         Weight 
##              0              0              0              2              0 
## Shucked.Weight Viscera.Weight   Shell.Weight            Age 
##              0              0              0              0

Check if there is any empty string value in the dataset

colSums(crabage=="")

##            Sex         Length       Diameter         Height         Weight 
##              0              0              0              0              0 
## Shucked.Weight Viscera.Weight   Shell.Weight            Age 
##              0              0              0              0

Replace data in the Sex column from character to string

crabage <- crabage %>%
  mutate(Sex = ifelse(Sex == 'F', 'Female', Sex)) %>%
  mutate(Sex = ifelse(Sex == 'M', 'Male', Sex)) %>%
  mutate(Sex = ifelse(Sex == 'I', 'Indeterminate', Sex)) %>%
  mutate(Sex = as.factor(Sex))

ggplot(crabage, aes(x = Sex, fill = Sex)) +
  geom_bar() +
  labs(x = "Sex", y = "Count") +
  ggtitle("Number of Crab by Sex") +
  theme(plot.title = element_text(hjust = 0.5))

Remove 0 value in “Height”

num_ht_before <- nrow(crabage[crabage$Height == 0, ])
num_tot_before <- nrow(crabage)

crabage <- crabage[crabage$Height > 0, ]

message(
  "The total number of rows with zero height changed from ", num_ht_before,
  " to ", nrow(crabage[crabage$Height == 0, ]),
  " (", num_ht_before - nrow(crabage[crabage$Height == 0, ]), ifelse(num_ht_before - nrow(crabage[crabage$Height == 0, ]) > 1, " rows", " row"), " removed).\n",
  "The total number of rows changed from ", num_tot_before,
  " to ", nrow(crabage),
  " (", num_tot_before - nrow(crabage), ifelse(num_tot_before - nrow(crabage) > 1, " rows", " row"), " removed)."
)

## The total number of rows with zero height changed from 2 to 0 (2 rows removed).
## The total number of rows changed from 3893 to 3891 (2 rows removed).

Grouping the age according to age group

crabage <- crabage %>%
  mutate(AgeGroup=case_when(
    Age < 10 ~ "Young",
    Age >= 10 & Age <= 18 ~ "Adult",
    Age > 18 ~ "Old"
  )) %>%
  mutate(AgeGroup=factor(AgeGroup, levels=c("Young", "Adult", "Old")))

Checking the outliers and remove them from numerical features only

remove_outliers <- function(column) {
  boxplot <- crabage %>%
    dplyr::select(!!column) %>%
    boxplot(main = column, plot = FALSE)
  
  outliers <- boxplot$out
  num_before <- nrow(crabage)
  
  filtered_data <- crabage %>%
    dplyr::filter(!(!!column %in% outliers))
  
  num_after <- nrow(filtered_data)
  diff <- num_before - num_after
  caption <- paste0(diff, " outliers are identified and removed.\n", "The total number of rows changed from ", num_before, " to ", num_after, ".\n\n\n")
  
  print(
    filtered_data %>%
      ggplot(aes(x = !!column)) +
      geom_boxplot(outlier.size = 2 ,color="red", fill="orange") +
      labs(caption = caption) +
      theme_minimal() +
      theme(
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank(),
        plot.caption = element_text(hjust = 0.5, face = "italic", size = 12, color = "red")
      )
  )
  
  return(filtered_data)
}

columns <- c(quo(Length), quo(Diameter), quo(Height), quo(Weight), quo(Shucked.Weight), quo(Viscera.Weight), quo(Shell.Weight))


for (column in columns) {
  filtered_data <- remove_outliers(column)
  crabage <- filtered_data
}

Normalization-To ensure that different features have equal importance during model training

cols <- c("Length","Diameter", "Height","Weight","Shucked.Weight", "Viscera.Weight","Shell.Weight", "Age")
head(crabage[,cols])

##   Length Diameter Height    Weight Shucked.Weight Viscera.Weight Shell.Weight
## 1 1.4375   1.1750 0.4125 24.635715      12.332033       5.584852     6.747181
## 2 0.8875   0.6500 0.2125  5.400580       2.296310       1.374951     1.559222
## 3 1.0375   0.7750 0.2500  7.952035       3.231843       1.601747     2.764076
## 4 1.1750   0.8875 0.2500 13.480187       4.748541       2.282135     5.244657
## 5 0.8875   0.6625 0.2125  6.903103       3.458639       1.488349     1.700970
## 6 1.5500   1.1625 0.3500 28.661344      13.579410       6.761356     7.229122
##   Age
## 1   9
## 2   6
## 3   6
## 4  10
## 5   6
## 6   8

scaler <- preProcess(crabage[, cols], method=c("center", "scale"))
crabage[, cols] <- predict(scaler, crabage[, cols])
head(crabage[,cols]
)

##       Length   Diameter      Height     Weight Shucked.Weight Viscera.Weight
## 1  0.4478178  0.6639802  0.71809842  0.1220275      0.4090012      0.1889611
## 2 -1.5436978 -1.6246994 -1.53303811 -1.3977365     -1.3496324     -1.2820981
## 3 -1.0005572 -1.0797757 -1.11095001 -1.1961466     -1.1856920     -1.2028491
## 4 -0.5026783 -0.5893443 -1.11095001 -0.7593684     -0.9199098     -0.9651022
## 5 -1.5436978 -1.5702070 -1.53303811 -1.2790224     -1.1459488     -1.2424736
## 6  0.8551732  0.6094879  0.01461826  0.4400916      0.6275884      0.6000653
##   Shell.Weight          Age
## 1   0.02359436 -0.312196092
## 2  -1.42650664 -1.275544032
## 3  -1.08973455 -1.275544032
## 4  -0.39638024  0.008919888
## 5  -1.38688639 -1.275544032
## 6   0.15830320 -0.633312072

Rename the features

crabage <- crabage %>%
  rename(`Length(ft)`=Length,`Diameter(ft)`=Diameter, `Height (ft)` = Height, `Weight (oz)` = Weight,`Shucked.Weight (oz)` = Shucked.Weight,`Viscera.Weight(oz)` = Viscera.Weight,`Shell.Weight(oz)` = Shell.Weight,`Age (months)` = Age)

View the new features name

head(crabage)

##             Sex Length(ft) Diameter(ft) Height (ft) Weight (oz)
## 1        Female  0.4478178    0.6639802  0.71809842   0.1220275
## 2          Male -1.5436978   -1.6246994 -1.53303811  -1.3977365
## 3 Indeterminate -1.0005572   -1.0797757 -1.11095001  -1.1961466
## 4        Female -0.5026783   -0.5893443 -1.11095001  -0.7593684
## 5 Indeterminate -1.5436978   -1.5702070 -1.53303811  -1.2790224
## 6        Female  0.8551732    0.6094879  0.01461826   0.4400916
##   Shucked.Weight (oz) Viscera.Weight(oz) Shell.Weight(oz) Age (months) AgeGroup
## 1           0.4090012          0.1889611       0.02359436 -0.312196092    Young
## 2          -1.3496324         -1.2820981      -1.42650664 -1.275544032    Young
## 3          -1.1856920         -1.2028491      -1.08973455 -1.275544032    Young
## 4          -0.9199098         -0.9651022      -0.39638024  0.008919888    Adult
## 5          -1.1459488         -1.2424736      -1.38688639 -1.275544032    Young
## 6           0.6275884          0.6000653       0.15830320 -0.633312072    Young

Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves examining and visualizing the dataset to gain insights, understand data patterns, and identify potential issues. It helps in formulating hypotheses, guiding feature engineering, and informing subsequent data analysis and modeling steps.

glimpse(crabage)

## Rows: 3,744
## Columns: 10
## $ Sex                   <fct> Female, Male, Indeterminate, Female, Indetermina…
## $ `Length(ft)`          <dbl> 0.44781778, -1.54369776, -1.00055716, -0.5026782…
## $ `Diameter(ft)`        <dbl> 0.66398024, -1.62469940, -1.07977567, -0.5893443…
## $ `Height (ft)`         <dbl> 0.71809842, -1.53303811, -1.11095001, -1.1109500…
## $ `Weight (oz)`         <dbl> 0.12202753, -1.39773648, -1.19614656, -0.7593684…
## $ `Shucked.Weight (oz)` <dbl> 0.409001165, -1.349632392, -1.185691975, -0.9199…
## $ `Viscera.Weight(oz)`  <dbl> 0.18896115, -1.28209812, -1.20284913, -0.9651021…
## $ `Shell.Weight(oz)`    <dbl> 0.02359436, -1.42650664, -1.08973455, -0.3963802…
## $ `Age (months)`        <dbl> -0.312196092, -1.275544032, -1.275544032, 0.0089…
## $ AgeGroup              <fct> Young, Young, Young, Adult, Young, Young, Adult,…

Check the correlation between numerical features

data_numeric = select_if(crabage, is.numeric)
data_corr <- cor(data_numeric)
data_corr

##                     Length(ft) Diameter(ft) Height (ft) Weight (oz)
## Length(ft)           1.0000000    0.9852873   0.8899958   0.9396262
## Diameter(ft)         0.9852873    1.0000000   0.8971888   0.9386415
## Height (ft)          0.8899958    0.8971888   1.0000000   0.8917453
## Weight (oz)          0.9396262    0.9386415   0.8917453   1.0000000
## Shucked.Weight (oz)  0.9130860    0.9065021   0.8383080   0.9678464
## Viscera.Weight(oz)   0.9123245    0.9070372   0.8676047   0.9640747
## Shell.Weight(oz)     0.9150325    0.9227671   0.8982060   0.9580735
## Age (months)         0.5150480    0.5369414   0.5762342   0.5145573
##                     Shucked.Weight (oz) Viscera.Weight(oz) Shell.Weight(oz)
## Length(ft)                    0.9130860          0.9123245        0.9150325
## Diameter(ft)                  0.9065021          0.9070372        0.9227671
## Height (ft)                   0.8383080          0.8676047        0.8982060
## Weight (oz)                   0.9678464          0.9640747        0.9580735
## Shucked.Weight (oz)           1.0000000          0.9260240        0.8864775
## Viscera.Weight(oz)            0.9260240          1.0000000        0.9116556
## Shell.Weight(oz)              0.8864775          0.9116556        1.0000000
## Age (months)                  0.3902022          0.4774974        0.6033754
##                     Age (months)
## Length(ft)             0.5150480
## Diameter(ft)           0.5369414
## Height (ft)            0.5762342
## Weight (oz)            0.5145573
## Shucked.Weight (oz)    0.3902022
## Viscera.Weight(oz)     0.4774974
## Shell.Weight(oz)       0.6033754
## Age (months)           1.0000000

Plot the correlation

corrplot(data_corr, method = "circle", type = "upper", tl.col = "black")

Remove the highly correlated features

index <- which(data_corr > 0.95 & data_corr < 1, arr.ind=TRUE) 

cbind.data.frame(var1 = rownames(data_corr)[index[,1]], 
                 var2 = colnames(data_corr)[index[,2]])

##                  var1                var2
## 1        Diameter(ft)          Length(ft)
## 2          Length(ft)        Diameter(ft)
## 3 Shucked.Weight (oz)         Weight (oz)
## 4  Viscera.Weight(oz)         Weight (oz)
## 5    Shell.Weight(oz)         Weight (oz)
## 6         Weight (oz) Shucked.Weight (oz)
## 7         Weight (oz)  Viscera.Weight(oz)
## 8         Weight (oz)    Shell.Weight(oz)

crabage <- crabage %>%
  select(-`Diameter(ft)`, -`Shucked.Weight (oz)`, -`Viscera.Weight(oz)`, -`Shell.Weight(oz)`)
glimpse(crabage)

## Rows: 3,744
## Columns: 6
## $ Sex            <fct> Female, Male, Indeterminate, Female, Indeterminate, Fem…
## $ `Length(ft)`   <dbl> 0.44781778, -1.54369776, -1.00055716, -0.50267827, -1.5…
## $ `Height (ft)`  <dbl> 0.71809842, -1.53303811, -1.11095001, -1.11095001, -1.5…
## $ `Weight (oz)`  <dbl> 0.12202753, -1.39773648, -1.19614656, -0.75936840, -1.2…
## $ `Age (months)` <dbl> -0.312196092, -1.275544032, -1.275544032, 0.008919888, …
## $ AgeGroup       <fct> Young, Young, Young, Adult, Young, Young, Adult, Adult,…

Check the distribution and skewness

layout(matrix(c(1, 2, 3, 4), 2, 2, byrow=TRUE))
with(crabage, hist(`Length(ft)` ,col = "green"))
with(crabage, hist(`Height (ft)`, col = "purple"))
with(crabage, hist(`Weight (oz)` ,col = "red"))
with(crabage, hist(`Age (months)` ,col = "yellow"))

Train-Test data preparation

set.seed(72)
trainpercent <- createDataPartition(crabage$`Age (months)`, p=0.8, list=FALSE)
TrainData <- crabage[trainpercent, ]
TestData <- crabage[-trainpercent, ]

 

message(
  "Initial: ", nrow(crabage), " rows.\n",
  "Train: ", nrow(TrainData), " rows (or ", round((nrow(TrainData)/nrow(crabage))*100, 2), "% of ", nrow(crabage), " rows).\n",
  "Test: ", nrow(TestData), " rows (or ", round((nrow(TestData)/nrow(crabage))*100, 2), "% of ", nrow(crabage), " rows)."
)

## Initial: 3744 rows.
## Train: 2997 rows (or 80.05% of 3744 rows).
## Test: 747 rows (or 19.95% of 3744 rows).

Train <- TrainData[, 1:6]
Reg_train <- TrainData[, "Age (months)"]
Cls_train <- TrainData[, "AgeGroup"]

 

Test <- TestData[-trainpercent, 1:6]
Reg_test <- TestData[-trainpercent, "Age (months)"]
Cls_test <- TestData[-trainpercent, "AgeGroup"]

Data Modelling and Model Evaluation

Objective 1: Predict Age of the Crab (Regression Algorithm)

To predict the age of crabs based on the physical attributes with regression algorithms

Regression algorithms is one type of supervised learning algorithms in machine learning that helps to map a predictive relationship between crab age and other attributes. Regression algorithms are used to answer the first question of this project in which these include linear regression algorithms, support vector regressor and xgboost regressor.

Linear Regression

AP_LR_model <- lm(`Age (months)`~., data=TrainData %>% select(-AgeGroup))
AP_LR_pred <- predict(AP_LR_model, Test)
message(
  "The mean absolute error (MAE) is ", round(mae(trunc(Reg_test), trunc(AP_LR_pred)), 4), ".\n",
  "The mean squared error (MSE) is ", round(mse(trunc(Reg_test), trunc(AP_LR_pred)), 4), ".\n",
  "The root mean squared error (RMSE) is ", round(rmse(trunc(Reg_test), trunc(AP_LR_pred)), 4), "."
)

## The mean absolute error (MAE) is 0.2653.
## The mean squared error (MSE) is 0.4422.
## The root mean squared error (RMSE) is 0.665.

Support Vector Regression

AP_SV_model <- svm(`Age (months)`~., data=TrainData %>% select(-AgeGroup))
AP_SV_pred <- predict(AP_SV_model, Test)
message(
  "The mean absolute error (MAE) is ", round(mae(trunc(Reg_test), trunc(AP_SV_pred)), 4), ".\n",
  "The mean squared error (MSE) is ", round(mse(trunc(Reg_test), trunc(AP_SV_pred)), 4), ".\n",
  "The root mean squared error (RMSE) is ", round(rmse(trunc(Reg_test), trunc(AP_SV_pred)), 4), "."
)

## The mean absolute error (MAE) is 0.2789.
## The mean squared error (MSE) is 0.4558.
## The root mean squared error (RMSE) is 0.6751.

XgBoost

Train_x = data.matrix(TrainData[,-5])
Train_y = TrainData[,5]
Test_x = data.matrix(TestData[, -5])
Test_y = TestData[, 5]
XGB_train = xgb.DMatrix(data = Train_x, label = Train_y)
XGB_test = xgb.DMatrix(data = Test_x, label = Test_y)
watchlist = list(train=XGB_train, test=XGB_test)
final = xgboost(data = XGB_train, max.depth = 3, nrounds = 56, verbose = 0)
pred_y = predict(final, XGB_test)

 

message(
  "The mean absolute error (MAE) is ", round(mae(trunc(Test_y), trunc(pred_y)), 4), ".\n",
  "The mean squared error (MSE) is ", round(mse(trunc(Test_y), trunc(pred_y)), 4), ".\n",
  "The root mean squared error (RMSE) is ", round(rmse(trunc(Test_y), trunc(pred_y)), 4), "."
)

## The mean absolute error (MAE) is 0.1941.
## The mean squared error (MSE) is 0.2262.
## The root mean squared error (RMSE) is 0.4756.

Objective 2: Predict Age Group of the Crab (Classification Algorithm)

To predict the age group of crabs based on the physical attributes with classification algorithms

Classification is a process of categorizing data instances into distinct classes or categories based on their features or attributes. Based on this case, we categorizing the age data into “Young”, “Adult” and “Old”. Then, to achieve the project objective we use Logistic Regression as well as Decision Tree.

Logistic Regression

AG_LogR_model <- multinom(AgeGroup~., data=TrainData %>% select(-`Age (months)`))

## # weights:  21 (12 variable)
## initial  value 3292.541029 
## iter  10 value 1849.006384
## iter  20 value 1825.166904
## final  value 1824.717106 
## converged

AG_LogR_pred <- predict(AG_LogR_model, Test)
confusionMatrix(AG_LogR_pred, Cls_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Young Adult Old
##      Young    52    16   1
##      Adult    26    49   3
##      Old       0     0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6871          
##                  95% CI : (0.6055, 0.7609)
##     No Information Rate : 0.5306          
##     P-Value [Acc > NIR] : 8.172e-05       
##                                           
##                   Kappa : 0.3939          
##                                           
##  Mcnemar's Test P-Value : 0.09448         
## 
## Statistics by Class:
## 
##                      Class: Young Class: Adult Class: Old
## Sensitivity                0.6667       0.7538    0.00000
## Specificity                0.7536       0.6463    1.00000
## Pos Pred Value             0.7536       0.6282        NaN
## Neg Pred Value             0.6667       0.7681    0.97279
## Prevalence                 0.5306       0.4422    0.02721
## Detection Rate             0.3537       0.3333    0.00000
## Detection Prevalence       0.4694       0.5306    0.00000
## Balanced Accuracy          0.7101       0.7001    0.50000

Decision Tree

AG_DT_model <- rpart(AgeGroup~., data=TrainData %>% select(-`Age (months)`))
AG_DT_pred <- predict(AG_DT_model, Test)
AG_DT_pred <- as.data.frame(AG_DT_pred) %>%
  mutate(Result=case_when(
    Young>=Adult & Young>=Old ~ "Young",
    Adult>=Young & Adult>=Old ~ "Adult",
    Old>=Young & Old>=Adult ~ "Old"
  )) %>%
  mutate(Result=factor(Result, levels=c("Young", "Adult", "Old"))) %>%
  pull(Result)
confusionMatrix(AG_DT_pred, Cls_test)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Young Adult Old
##      Young    39     9   1
##      Adult    39    56   3
##      Old       0     0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6463          
##                  95% CI : (0.5632, 0.7233)
##     No Information Rate : 0.5306          
##     P-Value [Acc > NIR] : 0.002995        
##                                           
##                   Kappa : 0.3305          
##                                           
##  Mcnemar's Test P-Value : 4.553e-05       
## 
## Statistics by Class:
## 
##                      Class: Young Class: Adult Class: Old
## Sensitivity                0.5000       0.8615    0.00000
## Specificity                0.8551       0.4878    1.00000
## Pos Pred Value             0.7959       0.5714        NaN
## Neg Pred Value             0.6020       0.8163    0.97279
## Prevalence                 0.5306       0.4422    0.02721
## Detection Rate             0.2653       0.3810    0.00000
## Detection Prevalence       0.3333       0.6667    0.00000
## Balanced Accuracy          0.6775       0.6747    0.50000

Conclusion

Regression Model Comparison

Model_Name	Mean.Absolute.Error..MAE.	Mean.Squared.Error..MSE.	Root.Mean.Squared.Error..RMSE.
Linear Regression	0.2653	0.4422	0.6650
Support Vector Regression	0.2789	0.4558	0.6751
XGBoost Regression	0.1941	0.2262	0.4756

Classification Model Comparison

Model_Name	Accuracy
Logistic Regression	0.6871
Decision Tree	0.6463

In conclusion, XGBoost is the best suited regression algorithm compared to other algorithms that have been used in this project. This could be due to its higher efficiency and ability to handle large datasets easily. On the other hand, logistic regression is the best suited classification algorithm compared to decision tree algorithm. The reason for this is because of the relationship between the features and target variable in this dataset is simple and linear. This gives the advantage for logistic regression to outperform decision trees in terms of accurately predicting the age group of the crab. Hence, the objectives of this project are fully achieved.

WQD7004- Group Assignment

FALSS

2023-06-16