👩 Amira Hanee binti Saifulazri(17202918)
👩 Saidatul Hanida Binti Mohd Yukhi(22082961)
👧 Sarvinnah Kajandren(22051663)
👩 Farzana Syakira binti Bahari(22058163)
👧 Liyujie(17103144)
Crab growth plays a significant role in the seafood industry, particularly when it comes to determine the commercial value of crabs based on essential characteristics such as size, weight, and meat quality.
Additionally, gaining insights into the age of the crab population is crucial for fishers and seafood processors as it enables them to make well-informed decisions regarding harvesting, processing, and marketing strategies.
In this section, the processes include identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(corrplot)
## corrplot 0.92 loaded
library(e1071)
library(klaR)
## Loading required package: MASS
library(nnet)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(rpart)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(xgboost)
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
crabage<- read.csv("CrabAgePrediction.csv")
summary(crabage)
## Sex Length Diameter Height
## Length:3893 Min. :0.1875 Min. :0.1375 Min. :0.0000
## Class :character 1st Qu.:1.1250 1st Qu.:0.8750 1st Qu.:0.2875
## Mode :character Median :1.3625 Median :1.0625 Median :0.3625
## Mean :1.3113 Mean :1.0209 Mean :0.3494
## 3rd Qu.:1.5375 3rd Qu.:1.2000 3rd Qu.:0.4125
## Max. :2.0375 Max. :1.6250 Max. :2.8250
## Weight Shucked.Weight Viscera.Weight Shell.Weight
## Min. : 0.0567 Min. : 0.02835 Min. : 0.01418 Min. : 0.04252
## 1st Qu.:12.6722 1st Qu.: 5.34388 1st Qu.: 2.66485 1st Qu.: 3.71378
## Median :22.7930 Median : 9.53961 Median : 4.86194 Median : 6.66213
## Mean :23.5673 Mean :10.20734 Mean : 5.13655 Mean : 6.79584
## 3rd Qu.:32.7862 3rd Qu.:14.27397 3rd Qu.: 7.20077 3rd Qu.: 9.35534
## Max. :80.1015 Max. :42.18406 Max. :21.54562 Max. :28.49125
## Age
## Min. : 1.000
## 1st Qu.: 8.000
## Median :10.000
## Mean : 9.955
## 3rd Qu.:11.000
## Max. :29.000
dim(crabage)
## [1] 3893 9
str(crabage)
## 'data.frame': 3893 obs. of 9 variables:
## $ Sex : chr "F" "M" "I" "F" ...
## $ Length : num 1.438 0.887 1.038 1.175 0.887 ...
## $ Diameter : num 1.175 0.65 0.775 0.887 0.662 ...
## $ Height : num 0.412 0.212 0.25 0.25 0.212 ...
## $ Weight : num 24.64 5.4 7.95 13.48 6.9 ...
## $ Shucked.Weight: num 12.33 2.3 3.23 4.75 3.46 ...
## $ Viscera.Weight: num 5.58 1.37 1.6 2.28 1.49 ...
## $ Shell.Weight : num 6.75 1.56 2.76 5.24 1.7 ...
## $ Age : int 9 6 6 10 6 8 15 10 13 7 ...
colSums(is.na(crabage))
## Sex Length Diameter Height Weight
## 0 0 0 0 0
## Shucked.Weight Viscera.Weight Shell.Weight Age
## 0 0 0 0
colSums(crabage==0)
## Sex Length Diameter Height Weight
## 0 0 0 2 0
## Shucked.Weight Viscera.Weight Shell.Weight Age
## 0 0 0 0
colSums(crabage=="")
## Sex Length Diameter Height Weight
## 0 0 0 0 0
## Shucked.Weight Viscera.Weight Shell.Weight Age
## 0 0 0 0
crabage <- crabage %>%
mutate(Sex = ifelse(Sex == 'F', 'Female', Sex)) %>%
mutate(Sex = ifelse(Sex == 'M', 'Male', Sex)) %>%
mutate(Sex = ifelse(Sex == 'I', 'Indeterminate', Sex)) %>%
mutate(Sex = as.factor(Sex))
ggplot(crabage, aes(x = Sex, fill = Sex)) +
geom_bar() +
labs(x = "Sex", y = "Count") +
ggtitle("Number of Crab by Sex") +
theme(plot.title = element_text(hjust = 0.5))
num_ht_before <- nrow(crabage[crabage$Height == 0, ])
num_tot_before <- nrow(crabage)
crabage <- crabage[crabage$Height > 0, ]
message(
"The total number of rows with zero height changed from ", num_ht_before,
" to ", nrow(crabage[crabage$Height == 0, ]),
" (", num_ht_before - nrow(crabage[crabage$Height == 0, ]), ifelse(num_ht_before - nrow(crabage[crabage$Height == 0, ]) > 1, " rows", " row"), " removed).\n",
"The total number of rows changed from ", num_tot_before,
" to ", nrow(crabage),
" (", num_tot_before - nrow(crabage), ifelse(num_tot_before - nrow(crabage) > 1, " rows", " row"), " removed)."
)
## The total number of rows with zero height changed from 2 to 0 (2 rows removed).
## The total number of rows changed from 3893 to 3891 (2 rows removed).
crabage <- crabage %>%
mutate(AgeGroup=case_when(
Age < 10 ~ "Young",
Age >= 10 & Age <= 18 ~ "Adult",
Age > 18 ~ "Old"
)) %>%
mutate(AgeGroup=factor(AgeGroup, levels=c("Young", "Adult", "Old")))
remove_outliers <- function(column) {
boxplot <- crabage %>%
dplyr::select(!!column) %>%
boxplot(main = column, plot = FALSE)
outliers <- boxplot$out
num_before <- nrow(crabage)
filtered_data <- crabage %>%
dplyr::filter(!(!!column %in% outliers))
num_after <- nrow(filtered_data)
diff <- num_before - num_after
caption <- paste0(diff, " outliers are identified and removed.\n", "The total number of rows changed from ", num_before, " to ", num_after, ".\n\n\n")
print(
filtered_data %>%
ggplot(aes(x = !!column)) +
geom_boxplot(outlier.size = 2 ,color="red", fill="orange") +
labs(caption = caption) +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
plot.caption = element_text(hjust = 0.5, face = "italic", size = 12, color = "red")
)
)
return(filtered_data)
}
columns <- c(quo(Length), quo(Diameter), quo(Height), quo(Weight), quo(Shucked.Weight), quo(Viscera.Weight), quo(Shell.Weight))
for (column in columns) {
filtered_data <- remove_outliers(column)
crabage <- filtered_data
}
cols <- c("Length","Diameter", "Height","Weight","Shucked.Weight", "Viscera.Weight","Shell.Weight", "Age")
head(crabage[,cols])
## Length Diameter Height Weight Shucked.Weight Viscera.Weight Shell.Weight
## 1 1.4375 1.1750 0.4125 24.635715 12.332033 5.584852 6.747181
## 2 0.8875 0.6500 0.2125 5.400580 2.296310 1.374951 1.559222
## 3 1.0375 0.7750 0.2500 7.952035 3.231843 1.601747 2.764076
## 4 1.1750 0.8875 0.2500 13.480187 4.748541 2.282135 5.244657
## 5 0.8875 0.6625 0.2125 6.903103 3.458639 1.488349 1.700970
## 6 1.5500 1.1625 0.3500 28.661344 13.579410 6.761356 7.229122
## Age
## 1 9
## 2 6
## 3 6
## 4 10
## 5 6
## 6 8
scaler <- preProcess(crabage[, cols], method=c("center", "scale"))
crabage[, cols] <- predict(scaler, crabage[, cols])
head(crabage[,cols]
)
## Length Diameter Height Weight Shucked.Weight Viscera.Weight
## 1 0.4478178 0.6639802 0.71809842 0.1220275 0.4090012 0.1889611
## 2 -1.5436978 -1.6246994 -1.53303811 -1.3977365 -1.3496324 -1.2820981
## 3 -1.0005572 -1.0797757 -1.11095001 -1.1961466 -1.1856920 -1.2028491
## 4 -0.5026783 -0.5893443 -1.11095001 -0.7593684 -0.9199098 -0.9651022
## 5 -1.5436978 -1.5702070 -1.53303811 -1.2790224 -1.1459488 -1.2424736
## 6 0.8551732 0.6094879 0.01461826 0.4400916 0.6275884 0.6000653
## Shell.Weight Age
## 1 0.02359436 -0.312196092
## 2 -1.42650664 -1.275544032
## 3 -1.08973455 -1.275544032
## 4 -0.39638024 0.008919888
## 5 -1.38688639 -1.275544032
## 6 0.15830320 -0.633312072
crabage <- crabage %>%
rename(`Length(ft)`=Length,`Diameter(ft)`=Diameter, `Height (ft)` = Height, `Weight (oz)` = Weight,`Shucked.Weight (oz)` = Shucked.Weight,`Viscera.Weight(oz)` = Viscera.Weight,`Shell.Weight(oz)` = Shell.Weight,`Age (months)` = Age)
head(crabage)
## Sex Length(ft) Diameter(ft) Height (ft) Weight (oz)
## 1 Female 0.4478178 0.6639802 0.71809842 0.1220275
## 2 Male -1.5436978 -1.6246994 -1.53303811 -1.3977365
## 3 Indeterminate -1.0005572 -1.0797757 -1.11095001 -1.1961466
## 4 Female -0.5026783 -0.5893443 -1.11095001 -0.7593684
## 5 Indeterminate -1.5436978 -1.5702070 -1.53303811 -1.2790224
## 6 Female 0.8551732 0.6094879 0.01461826 0.4400916
## Shucked.Weight (oz) Viscera.Weight(oz) Shell.Weight(oz) Age (months) AgeGroup
## 1 0.4090012 0.1889611 0.02359436 -0.312196092 Young
## 2 -1.3496324 -1.2820981 -1.42650664 -1.275544032 Young
## 3 -1.1856920 -1.2028491 -1.08973455 -1.275544032 Young
## 4 -0.9199098 -0.9651022 -0.39638024 0.008919888 Adult
## 5 -1.1459488 -1.2424736 -1.38688639 -1.275544032 Young
## 6 0.6275884 0.6000653 0.15830320 -0.633312072 Young
glimpse(crabage)
## Rows: 3,744
## Columns: 10
## $ Sex <fct> Female, Male, Indeterminate, Female, Indetermina…
## $ `Length(ft)` <dbl> 0.44781778, -1.54369776, -1.00055716, -0.5026782…
## $ `Diameter(ft)` <dbl> 0.66398024, -1.62469940, -1.07977567, -0.5893443…
## $ `Height (ft)` <dbl> 0.71809842, -1.53303811, -1.11095001, -1.1109500…
## $ `Weight (oz)` <dbl> 0.12202753, -1.39773648, -1.19614656, -0.7593684…
## $ `Shucked.Weight (oz)` <dbl> 0.409001165, -1.349632392, -1.185691975, -0.9199…
## $ `Viscera.Weight(oz)` <dbl> 0.18896115, -1.28209812, -1.20284913, -0.9651021…
## $ `Shell.Weight(oz)` <dbl> 0.02359436, -1.42650664, -1.08973455, -0.3963802…
## $ `Age (months)` <dbl> -0.312196092, -1.275544032, -1.275544032, 0.0089…
## $ AgeGroup <fct> Young, Young, Young, Adult, Young, Young, Adult,…
data_numeric = select_if(crabage, is.numeric)
data_corr <- cor(data_numeric)
data_corr
## Length(ft) Diameter(ft) Height (ft) Weight (oz)
## Length(ft) 1.0000000 0.9852873 0.8899958 0.9396262
## Diameter(ft) 0.9852873 1.0000000 0.8971888 0.9386415
## Height (ft) 0.8899958 0.8971888 1.0000000 0.8917453
## Weight (oz) 0.9396262 0.9386415 0.8917453 1.0000000
## Shucked.Weight (oz) 0.9130860 0.9065021 0.8383080 0.9678464
## Viscera.Weight(oz) 0.9123245 0.9070372 0.8676047 0.9640747
## Shell.Weight(oz) 0.9150325 0.9227671 0.8982060 0.9580735
## Age (months) 0.5150480 0.5369414 0.5762342 0.5145573
## Shucked.Weight (oz) Viscera.Weight(oz) Shell.Weight(oz)
## Length(ft) 0.9130860 0.9123245 0.9150325
## Diameter(ft) 0.9065021 0.9070372 0.9227671
## Height (ft) 0.8383080 0.8676047 0.8982060
## Weight (oz) 0.9678464 0.9640747 0.9580735
## Shucked.Weight (oz) 1.0000000 0.9260240 0.8864775
## Viscera.Weight(oz) 0.9260240 1.0000000 0.9116556
## Shell.Weight(oz) 0.8864775 0.9116556 1.0000000
## Age (months) 0.3902022 0.4774974 0.6033754
## Age (months)
## Length(ft) 0.5150480
## Diameter(ft) 0.5369414
## Height (ft) 0.5762342
## Weight (oz) 0.5145573
## Shucked.Weight (oz) 0.3902022
## Viscera.Weight(oz) 0.4774974
## Shell.Weight(oz) 0.6033754
## Age (months) 1.0000000
corrplot(data_corr, method = "circle", type = "upper", tl.col = "black")
layout(matrix(c(1, 2, 3, 4), 2, 2, byrow=TRUE))
with(crabage, hist(`Length(ft)` ,col = "green"))
with(crabage, hist(`Height (ft)`, col = "purple"))
with(crabage, hist(`Weight (oz)` ,col = "red"))
with(crabage, hist(`Age (months)` ,col = "yellow"))
set.seed(72)
trainpercent <- createDataPartition(crabage$`Age (months)`, p=0.8, list=FALSE)
TrainData <- crabage[trainpercent, ]
TestData <- crabage[-trainpercent, ]
message(
"Initial: ", nrow(crabage), " rows.\n",
"Train: ", nrow(TrainData), " rows (or ", round((nrow(TrainData)/nrow(crabage))*100, 2), "% of ", nrow(crabage), " rows).\n",
"Test: ", nrow(TestData), " rows (or ", round((nrow(TestData)/nrow(crabage))*100, 2), "% of ", nrow(crabage), " rows)."
)
## Initial: 3744 rows.
## Train: 2997 rows (or 80.05% of 3744 rows).
## Test: 747 rows (or 19.95% of 3744 rows).
Train <- TrainData[, 1:6]
Reg_train <- TrainData[, "Age (months)"]
Cls_train <- TrainData[, "AgeGroup"]
Test <- TestData[-trainpercent, 1:6]
Reg_test <- TestData[-trainpercent, "Age (months)"]
Cls_test <- TestData[-trainpercent, "AgeGroup"]
To predict the age of crabs based on the physical attributes with regression algorithms
Regression algorithms is one type of supervised learning algorithms in machine learning that helps to map a predictive relationship between crab age and other attributes. Regression algorithms are used to answer the first question of this project in which these include linear regression algorithms, support vector regressor and xgboost regressor.
AP_LR_model <- lm(`Age (months)`~., data=TrainData %>% select(-AgeGroup))
AP_LR_pred <- predict(AP_LR_model, Test)
message(
"The mean absolute error (MAE) is ", round(mae(trunc(Reg_test), trunc(AP_LR_pred)), 4), ".\n",
"The mean squared error (MSE) is ", round(mse(trunc(Reg_test), trunc(AP_LR_pred)), 4), ".\n",
"The root mean squared error (RMSE) is ", round(rmse(trunc(Reg_test), trunc(AP_LR_pred)), 4), "."
)
## The mean absolute error (MAE) is 0.2653.
## The mean squared error (MSE) is 0.4422.
## The root mean squared error (RMSE) is 0.665.
AP_SV_model <- svm(`Age (months)`~., data=TrainData %>% select(-AgeGroup))
AP_SV_pred <- predict(AP_SV_model, Test)
message(
"The mean absolute error (MAE) is ", round(mae(trunc(Reg_test), trunc(AP_SV_pred)), 4), ".\n",
"The mean squared error (MSE) is ", round(mse(trunc(Reg_test), trunc(AP_SV_pred)), 4), ".\n",
"The root mean squared error (RMSE) is ", round(rmse(trunc(Reg_test), trunc(AP_SV_pred)), 4), "."
)
## The mean absolute error (MAE) is 0.2789.
## The mean squared error (MSE) is 0.4558.
## The root mean squared error (RMSE) is 0.6751.
Train_x = data.matrix(TrainData[,-5])
Train_y = TrainData[,5]
Test_x = data.matrix(TestData[, -5])
Test_y = TestData[, 5]
XGB_train = xgb.DMatrix(data = Train_x, label = Train_y)
XGB_test = xgb.DMatrix(data = Test_x, label = Test_y)
watchlist = list(train=XGB_train, test=XGB_test)
final = xgboost(data = XGB_train, max.depth = 3, nrounds = 56, verbose = 0)
pred_y = predict(final, XGB_test)
message(
"The mean absolute error (MAE) is ", round(mae(trunc(Test_y), trunc(pred_y)), 4), ".\n",
"The mean squared error (MSE) is ", round(mse(trunc(Test_y), trunc(pred_y)), 4), ".\n",
"The root mean squared error (RMSE) is ", round(rmse(trunc(Test_y), trunc(pred_y)), 4), "."
)
## The mean absolute error (MAE) is 0.1941.
## The mean squared error (MSE) is 0.2262.
## The root mean squared error (RMSE) is 0.4756.
To predict the age group of crabs based on the physical attributes with classification algorithms
Classification is a process of categorizing data instances into distinct classes or categories based on their features or attributes. Based on this case, we categorizing the age data into “Young”, “Adult” and “Old”. Then, to achieve the project objective we use Logistic Regression as well as Decision Tree.
AG_LogR_model <- multinom(AgeGroup~., data=TrainData %>% select(-`Age (months)`))
## # weights: 21 (12 variable)
## initial value 3292.541029
## iter 10 value 1849.006384
## iter 20 value 1825.166904
## final value 1824.717106
## converged
AG_LogR_pred <- predict(AG_LogR_model, Test)
confusionMatrix(AG_LogR_pred, Cls_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Young Adult Old
## Young 52 16 1
## Adult 26 49 3
## Old 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6871
## 95% CI : (0.6055, 0.7609)
## No Information Rate : 0.5306
## P-Value [Acc > NIR] : 8.172e-05
##
## Kappa : 0.3939
##
## Mcnemar's Test P-Value : 0.09448
##
## Statistics by Class:
##
## Class: Young Class: Adult Class: Old
## Sensitivity 0.6667 0.7538 0.00000
## Specificity 0.7536 0.6463 1.00000
## Pos Pred Value 0.7536 0.6282 NaN
## Neg Pred Value 0.6667 0.7681 0.97279
## Prevalence 0.5306 0.4422 0.02721
## Detection Rate 0.3537 0.3333 0.00000
## Detection Prevalence 0.4694 0.5306 0.00000
## Balanced Accuracy 0.7101 0.7001 0.50000
AG_DT_model <- rpart(AgeGroup~., data=TrainData %>% select(-`Age (months)`))
AG_DT_pred <- predict(AG_DT_model, Test)
AG_DT_pred <- as.data.frame(AG_DT_pred) %>%
mutate(Result=case_when(
Young>=Adult & Young>=Old ~ "Young",
Adult>=Young & Adult>=Old ~ "Adult",
Old>=Young & Old>=Adult ~ "Old"
)) %>%
mutate(Result=factor(Result, levels=c("Young", "Adult", "Old"))) %>%
pull(Result)
confusionMatrix(AG_DT_pred, Cls_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Young Adult Old
## Young 39 9 1
## Adult 39 56 3
## Old 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6463
## 95% CI : (0.5632, 0.7233)
## No Information Rate : 0.5306
## P-Value [Acc > NIR] : 0.002995
##
## Kappa : 0.3305
##
## Mcnemar's Test P-Value : 4.553e-05
##
## Statistics by Class:
##
## Class: Young Class: Adult Class: Old
## Sensitivity 0.5000 0.8615 0.00000
## Specificity 0.8551 0.4878 1.00000
## Pos Pred Value 0.7959 0.5714 NaN
## Neg Pred Value 0.6020 0.8163 0.97279
## Prevalence 0.5306 0.4422 0.02721
## Detection Rate 0.2653 0.3810 0.00000
## Detection Prevalence 0.3333 0.6667 0.00000
## Balanced Accuracy 0.6775 0.6747 0.50000
Regression Model Comparison
| Model_Name | Mean.Absolute.Error..MAE. | Mean.Squared.Error..MSE. | Root.Mean.Squared.Error..RMSE. |
|---|---|---|---|
| Linear Regression | 0.2653 | 0.4422 | 0.6650 |
| Support Vector Regression | 0.2789 | 0.4558 | 0.6751 |
| XGBoost Regression | 0.1941 | 0.2262 | 0.4756 |
Classification Model Comparison
| Model_Name | Accuracy |
|---|---|
| Logistic Regression | 0.6871 |
| Decision Tree | 0.6463 |
In conclusion, XGBoost is the best suited regression algorithm compared to other algorithms that have been used in this project. This could be due to its higher efficiency and ability to handle large datasets easily. On the other hand, logistic regression is the best suited classification algorithm compared to decision tree algorithm. The reason for this is because of the relationship between the features and target variable in this dataset is simple and linear. This gives the advantage for logistic regression to outperform decision trees in terms of accurately predicting the age group of the crab. Hence, the objectives of this project are fully achieved.