This document will go through how we can assess the profitability and sustainability of harvesting abalone via various statistical techniques.
# stratified train val test
#standardize on train
# why or why not PCA on train
# why we do PCA , why it reduces multicolinarity
#methods - SVM, Multiclass logistic regression (no need for multivariate as 1 response )
# crossvalidation talk about
#predict - logistic and SVM
#non PCA predict - logistics - SVM
#hyper para tune with VAL , cross val
#analysis of results
# interpretation of model coefficients
# note that standardized data has diff inter...
# precision recall what they mean for this scenario , which is more important in sustainability perspective.
# talk about normality how it affects interpretation but not prediction
# why you didnt do certain things - manova , LDA
# talk about why you picked these prediction models
#Residual analysis
#conclusion
# train test val
# note - predictors are different from q1
# PCA here or not (we need interpretation? or we dont?)
# multivariate regression model mlm() for shucked and visceral
# obtain coefficients
# create function for price index
# input is new x vector and prices
# using beta coefficients to form equation for price
# for prediction interval see james answer of ed and week3 2831 properties
# analysis of results
# make graph of price index at various values
# interpretation
# what do the models tell us # what answer the prompt # future investigations
In this section we shall process the data to be more effective for modelling.
abalone <- read.csv("/Users/dakshmukhra/Desktop/uni/masters/hex5_ZZSC5855_2024/abalone.csv")
abalone <-as.data.frame(abalone)
abalone$Sex <- as.factor(abalone$Sex)
Null and zero value detection is done below to detect any nonsensical values
colSums(is.na(abalone))
## Sex Length Diameter Height Whole.weight
## 0 0 0 0 0
## Shucked.weight Viscera.weight Shell.weight Rings
## 0 0 0 0
colSums(abalone=="")
## Sex Length Diameter Height Whole.weight
## 0 0 0 0 0
## Shucked.weight Viscera.weight Shell.weight Rings
## 0 0 0 0
colSums(abalone==0)
## Sex Length Diameter Height Whole.weight
## 0 0 0 2 0
## Shucked.weight Viscera.weight Shell.weight Rings
## 0 0 0 0
We can see that height has two zero values, we shall remove these. By inspecting the csv given and sorting by height we can see that their are two extreme outliers as well. Removing these outliers will ensure our distributions are more friendly.
abalone <- abalone[abalone$Height>0, ]
abalone = abalone[!(abalone$Height == 226),]
abalone = abalone[!(abalone$Height == 103),]
Our data is now ready for some exploration, we shall examine distributions and employ various statistical tests to gain a better understanding of the data.
par(mfrow = c(3, 3))
hist(abalone$Length,
xlab = abalone$Length,
main = "Histogram of length" ,
breaks = sqrt(length(abalone$Length)) # set number of bins
)
hist(abalone$Height,
xlab = abalone$Height,
main = "Histogram of Height" ,
breaks = sqrt(length(abalone$Height)) # set number of bins
)
hist(abalone$Diameter,
xlab = abalone$Diameter,
main = "Histogram of Diameter" ,
breaks = sqrt(length(abalone$Diameter)) # set number of bins
)
hist(abalone$Whole.weight,
xlab = abalone$Whole.weight,
main = "Histogram of Whole.weight" ,
breaks = sqrt(length(abalone$Whole.weight)) # set number of bins
)
hist(abalone$Shucked.weight,
xlab = abalone$Shucked.weight,
main = "Histogram of Shucked.weight" ,
breaks = sqrt(length(abalone$Shucked.weight)) # set number of bins
)
hist(abalone$Viscera.weight,
xlab = abalone$Viscera.weight,
main = "Histogram of Viscera.weight" ,
breaks = sqrt(length(abalone$Viscera.weight)) # set number of bins
)
hist(abalone$Shell.weight,
xlab = abalone$Shell.weight,
main = "Histogram of Shell.weight" ,
breaks = sqrt(length(abalone$Shell.weight)) # set number of bins
)
barplot(table(abalone$Sex),
ylab = "Frequency",
xlab = "Sex")
We can see from the above histograms the distributions of Length and
Diameter are left skewed whilst Height is neutral mimicking a normal
distribution. The variables of Shucked weight and Viscera weight are
highly right skewed. The distribution of sex among the classes of infant
female and male seem to be relatively even, hence class imbalance will
not be an issue in modelling.
Lets look further into our numeric features for question 1 i.e. Height, Length, Diameter
numeric_cols <- abalone[c("Height","Length","Diameter")]
ggpairs(numeric_cols)
as we can see from the plot above the pairwise correlations indicate
that the variables are highly correlated with one another. This further
emphasised by the scatter plots between the various pairs. Highly
correlated variables can be a problem for the interpretation for many
linear models however prediction strength is unchanged. From this plot
we can deduce that the three variables are non normal marginally and
jointly.
mvn(numeric_cols, mvnTest="mardia", multivariatePlot="qq")
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 1218.27867529684 1.64308044649984e-255 NO
## 2 Mardia Kurtosis 115.207675661647 0 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling Height 10.3260 <0.001 NO
## 2 Anderson-Darling Length 36.7518 <0.001 NO
## 3 Anderson-Darling Diameter 36.5694 <0.001 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th Skew
## Height 4173 27.85119 7.675505 28 2 50 23 33 -0.2491366
## Length 4173 104.80757 24.012124 109 15 163 90 123 -0.6410833
## Diameter 4173 81.58303 19.842093 85 11 130 70 96 -0.6103689
## Kurtosis
## Height -0.18448134
## Length 0.06529644
## Diameter -0.04515963
We can see from the above statistical tests (Mardia and Anderson Darling) that normality is rejected jointly and marginally. The Chi-Square plot also reveal the same conclusion with the data point veering off the diagonal. However this conclusion of non normality needs to be taken with a grain of salt. As mention within the ZZSC5855 notes
“as the sample size increases, the Central Limit Theorem tells us that many statistics, including sample means and (much more slowly) sample variances and covariances, approach normality—and multivariate statistics generally approach multivariate normality. This means that regardless of the underlying distribution, the statistical procedures depending on the normality assumption become valid even as the chances that a statistical hypothesis test will detect non-normality there is approaches 1”
By exampling the histograms shown earlier height and a transformed Length and Diameter (shown later on) will be adequate to assume multivariate normality even if the p-value is small for these statistical tests.
numeric_cols.T <- numeric_cols %>% mutate(
`Height` = (`Height`), # leave height as is
`Length` = Length^2,
`Diameter` = Diameter^2)
keeps <- c("Height","Length","Diameter")
numeric_cols.T = numeric_cols.T[keeps]
abalone.T <- numeric_cols.T
abalone.T$Sex <- abalone$Sex
ggpairs(abalone.T)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
as we can see using the square() transformation on Length and Diameter moved its distribution plots closer to normal.
Lets split our data into a training and testing set and scale it. The reason we shall scale the data set is that many models such as support vector machine require that its variables be on the same scale for it work effectively. Note the scaling is first done on the train set and using the mean and standard deviation from the train set we scale the test. This so that data leakage does not occur.
Also note that Principal Component analysis was explored as a correlation reduction technique however prediction results were largely unaffected and interpretation of coefficients was lost. Hence PCA has been omitted in the following data manipulation. Please see appendix for code if required.
# join transformed data frame
q1data <- abalone.T[c("Sex","Height","Length","Diameter")]
# Split into train and test
# createDataPartition automatically stratify samples to ensure classes are balnced in split
set.seed(42)
train.index <- createDataPartition(q1data$Sex, p = .7, list = FALSE)
train <- q1data[ train.index,]
test <- q1data[-train.index,]
##### NOTE SCALE AFTER SPLIT TO AVOID DATA LEAKAGE
# Scale train
train.scaled <- scale(train[c("Height","Length","Diameter")])
# apply the same mean and std of scaled train to test to avoid data leakage
test.scaled <- scale(test[c("Height","Length","Diameter")], center=attr(train.scaled, "scaled:center"), scale=attr(train.scaled, "scaled:scale"))
train.scaled <- as.data.frame(train.scaled)
test.scaled <- as.data.frame(test.scaled)
train.scaled$Sex <- train$Sex
test.scaled$Sex <- test$Sex
train.scaled$Sex <- relevel(factor(train.scaled$Sex), ref = "F")
test.scaled$Sex <- relevel(factor(test.scaled$Sex), ref = "F")
Now that we have adequately split our data we are ready to model the relationship between Sex and variables Length Height and Diameter. The models of multinomial logistic regression and support vector machine (svm) were chosen for this task, as we need a way to predict multiple classes for a single dependent variable (Sex). Logistic regression in particular was chosen for its simple interpretation of its coefficients and svm is robust in adjusting to non linear relationships in the data with an appropriate choice of Kernel.
The 4 relationships we require to be modeled are we shall model each one outputting a score for both training and test data: - predicting the sex of the abalone in general; - predicting specifically Infants as opposed to others (to avoid harvesting them); - predicting specifically Females as opposed to others (when profitability is prioritised); - predicting specifically Males as opposed to others (when sustainability is prioritised).
################## LOGISTIC -- TRAIN IS SCALED
# we can see that the Train data is not sperable linearly
colors <- c("#999999", "#E69F00", "#56B4E9")
colors <- colors[as.numeric(train.scaled$Sex)]
s3d <- scatterplot3d(train.scaled[c("Length","Diameter","Height")], pch = 16, color = colors , box=FALSE)
legend("right", legend = levels(train.scaled$Sex),
col = c("#999999", "#E69F00", "#56B4E9"), pch = 16)
As we can see form the above scatter plot the data is not linearly separable for the classes. Lets test this.
######## "Acc 0.522245037645448"
log.scaled.m <- multinom(Sex~Length + Diameter + Height, data = train.scaled)
## # weights: 15 (8 variable)
## initial value 3210.145107
## iter 10 value 2653.212832
## final value 2643.727856
## converged
summary(log.scaled.m)
## Call:
## multinom(formula = Sex ~ Length + Diameter + Height, data = train.scaled)
##
## Coefficients:
## (Intercept) Length Diameter Height
## I -0.2961434 0.9273012 -1.8720556 -0.9096382
## M 0.2356347 0.1949341 -0.1766856 -0.2217037
##
## Std. Errors:
## (Intercept) Length Diameter Height
## I 0.06173442 0.3380388 0.3523215 0.13488018
## M 0.05064883 0.2248627 0.2308985 0.09937892
##
## Residual Deviance: 5287.456
## AIC: 5303.456
#Predicting the values for train dataset
train.predicted_vals <- predict(log.scaled.m, newdata = train.scaled[c("Length","Diameter","Height")], "class")
# Building classification table
paste('Acc', accuracy(train.scaled$Sex, train.predicted_vals))
## [1] "Acc 0.522245037645448"
cf.train <- caret::confusionMatrix(data=train.predicted_vals, reference=train.scaled$Sex)
cf.train
## Confusion Matrix and Statistics
##
## Reference
## Prediction F I M
## F 170 11 173
## I 152 683 223
## M 593 244 673
##
## Overall Statistics
##
## Accuracy : 0.5222
## 95% CI : (0.5039, 0.5405)
## No Information Rate : 0.3658
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2726
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: F Class: I Class: M
## Sensitivity 0.18579 0.7281 0.6296
## Specificity 0.90832 0.8110 0.5483
## Pos Pred Value 0.48023 0.6456 0.4457
## Neg Pred Value 0.70989 0.8632 0.7195
## Prevalence 0.31314 0.3210 0.3658
## Detection Rate 0.05818 0.2337 0.2303
## Detection Prevalence 0.12115 0.3621 0.5168
## Balanced Accuracy 0.54706 0.7696 0.5889
The training set accuracy is “Acc 0.522245037645448”.We can see form the confusion matrix that the model inst able to correctly classify Female data points as much as Infants and Male. Lets examine the test set and predicted using are trained model.
test.predicted_vals <- predict(log.scaled.m, newdata = test.scaled[c("Length","Diameter","Height")], "class")
# Building classification table
paste('Acc', accuracy(test.scaled$Sex, test.predicted_vals))
## [1] "Acc 0.535571542765787"
cf.test <- caret::confusionMatrix(data=test.predicted_vals, reference=test.scaled$Sex)
cf.test
## Confusion Matrix and Statistics
##
## Reference
## Prediction F I M
## F 87 2 79
## I 61 307 103
## M 243 93 276
##
## Overall Statistics
##
## Accuracy : 0.5356
## 95% CI : (0.5075, 0.5635)
## No Information Rate : 0.3661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2941
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: F Class: I Class: M
## Sensitivity 0.22251 0.7637 0.6026
## Specificity 0.90581 0.8068 0.5763
## Pos Pred Value 0.51786 0.6518 0.4510
## Neg Pred Value 0.71930 0.8782 0.7152
## Prevalence 0.31255 0.3213 0.3661
## Detection Rate 0.06954 0.2454 0.2206
## Detection Prevalence 0.13429 0.3765 0.4892
## Balanced Accuracy 0.56416 0.7853 0.5895
The test set is “Acc 0.535571542765787” our model has generalised as much as it can predicting on new data. The results of the ocnfusion matrix for the test set are similar to the train set. Lets check out SVM.
summary(tuned.svm <- tune.svm(Sex~.,data=train.scaled, kernel="radial", gamma = 10^(-1:1), cost = 10^(-1:1)))
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 1 1
##
## - best performance: 0.4787835
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.1 0.1 0.4811807 0.02273328
## 2 1.0 0.1 0.4822081 0.02322048
## 3 10.0 0.1 0.4856188 0.02141920
## 4 0.1 1.0 0.4842618 0.02078784
## 5 1.0 1.0 0.4787835 0.02437308
## 6 10.0 1.0 0.4921210 0.02328719
## 7 0.1 10.0 0.4880254 0.02313988
## 8 1.0 10.0 0.4811866 0.02302472
## 9 10.0 10.0 0.5013523 0.02568625
tuned.svm$best.model
##
## Call:
## best.svm(x = Sex ~ ., data = train.scaled, gamma = 10^(-1:1), cost = 10^(-1:1),
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 2479
train.predicted_vals <- predict(tuned.svm$best.model, newdata=train.scaled[c("Length","Diameter","Height")], decision.values=TRUE)
# Building classification table
paste('Acc', accuracy(train.scaled$Sex, train.predicted_vals))
## [1] "Acc 0.538329911019849"
cf.train <- caret::confusionMatrix(data=train.predicted_vals, reference=train.scaled$Sex)
cf.train
## Confusion Matrix and Statistics
##
## Reference
## Prediction F I M
## F 157 14 106
## I 165 699 246
## M 593 225 717
##
## Overall Statistics
##
## Accuracy : 0.5383
## 95% CI : (0.5201, 0.5565)
## No Information Rate : 0.3658
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2964
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: F Class: I Class: M
## Sensitivity 0.17158 0.7452 0.6707
## Specificity 0.94021 0.7928 0.5586
## Pos Pred Value 0.56679 0.6297 0.4671
## Neg Pred Value 0.71342 0.8681 0.7462
## Prevalence 0.31314 0.3210 0.3658
## Detection Rate 0.05373 0.2392 0.2454
## Detection Prevalence 0.09480 0.3799 0.5253
## Balanced Accuracy 0.55590 0.7690 0.6146
For the SVM we tune across selection of hyper parameters and find the best selection and then using the best model we predict with the training set. The results are extremely similar to logistic train with a “Acc 0.538329911019849” and similar confusion matrix. lets see the test set.
test.predicted_vals <- predict(tuned.svm$best.model, newdata=test.scaled[c("Length","Diameter","Height")], decision.values=TRUE)
# Building classification table
paste('Acc', accuracy(test.scaled$Sex, test.predicted_vals))
## [1] "Acc 0.512390087929656"
cf.test <- caret::confusionMatrix(data=test.predicted_vals, reference=test.scaled$Sex)
cf.test
## Confusion Matrix and Statistics
##
## Reference
## Prediction F I M
## F 50 5 66
## I 68 310 111
## M 273 87 281
##
## Overall Statistics
##
## Accuracy : 0.5124
## 95% CI : (0.4843, 0.5404)
## No Information Rate : 0.3661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2573
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: F Class: I Class: M
## Sensitivity 0.12788 0.7711 0.6135
## Specificity 0.91744 0.7892 0.5460
## Pos Pred Value 0.41322 0.6339 0.4384
## Neg Pred Value 0.69823 0.8793 0.7098
## Prevalence 0.31255 0.3213 0.3661
## Detection Rate 0.03997 0.2478 0.2246
## Detection Prevalence 0.09672 0.3909 0.5124
## Balanced Accuracy 0.52266 0.7802 0.5798
Note an “Acc 0.512390087929656” score and similar in ability to classify Female data points as logistics regression even tho a non linear keneral method “radial” was specified.
Both Models of Logistic regression had mediocre performance on unseen data, as such logistic regression should be used with its accuracy score of 0.53 and highly interpretable coefficients.
As we shift to a binary classification problem class imbalance may become an issue as such more robust metrics such as F1-score will be reported. The same models above will be trained and tested.
First we need to get our data.
################## INFANT DATA PREP
inf.train.scaled <- train.scaled
levels(inf.train.scaled $Sex) =c('Not Infant', 'I', 'Not Infant')
inf.test.scaled <- test.scaled
levels(inf.test.scaled $Sex) =c('Not Infant', 'I', 'Not Infant')
Lets see the distribution for infants.
################## INFANT LOGISTIC TRAIN
colors <- c("#E69F00", "#56B4E9")
colors <- colors[as.numeric(inf.train.scaled$Sex)]
s3d <- scatterplot3d(inf.train.scaled[c("Length","Diameter","Height")], pch = 16, color = colors , box=FALSE)
legend("right", legend = levels(inf.train.scaled$Sex),
col = c("#E69F00", "#56B4E9"), pch = 16)
compared we can see that the data is much more linearly sperable.
#logistic regression
log.scaled.m <- glm(Sex~ Length + Diameter + Height, data=inf.train.scaled, family = 'binomial')
summary(log.scaled.m)
##
## Call:
## glm(formula = Sex ~ Length + Diameter + Height, family = "binomial",
## data = inf.train.scaled)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.11665 0.05453 -20.478 < 2e-16 ***
## Length 0.82011 0.31397 2.612 0.009 **
## Diameter -1.77512 0.32805 -5.411 6.26e-08 ***
## Height -0.78242 0.12210 -6.408 1.47e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3667.9 on 2921 degrees of freedom
## Residual deviance: 2563.6 on 2918 degrees of freedom
## AIC: 2571.6
##
## Number of Fisher Scoring iterations: 5
#Predicting the values for train dataset
train.predicted_vals <- predict(log.scaled.m, newdata = inf.train.scaled[c("Length","Diameter","Height")])
train.predict_class <- ifelse(train.predicted_vals>0.5, 'I', 'Not Infant')
#Accuracy 0.782340862422998
paste('Acc', accuracy(inf.train.scaled$Sex, train.predict_class))
## [1] "Acc 0.783367556468172"
#F1 score 0.8542621
f1_val <- MLmetrics::F1_Score(inf.train.scaled$Sex, train.predict_class)
f1_val
## [1] 0.8544493
# Building classification table
table(inf.train.scaled$Sex, train.predict_class)
## train.predict_class
## I Not Infant
## Not Infant 126 1858
## I 431 507
The initial train results are providing us with relative good accuracy of “Acc 0.783367556468172” and F1 score of 0.8544493. From the confusion matrix we can see that although non infants are classified at a higher rate infants are incorrectly classified with more frequency (low precision)
Lets examine the test set.
#Predicting the values for train dataset
test.predicted_vals <- predict(log.scaled.m, newdata = inf.test.scaled[c("Length","Diameter","Height")])
test.predict_class <- ifelse(test.predicted_vals>0.5, 'I', 'Not Infant')
paste('Acc', accuracy(inf.test.scaled$Sex, test.predict_class))
## [1] "Acc 0.778577138289368"
f1_val <- MLmetrics::F1_Score(inf.test.scaled$Sex, test.predict_class)
f1_val
## [1] 0.8521089
# Building classification table
table(inf.test.scaled$Sex, test.predict_class)
## test.predict_class
## I Not Infant
## Not Infant 51 798
## I 176 226
The results are comparable to train, we should note they are slightly lower as we are predicting on unseen data. Lets examine the SVM.
# train svm
tuned.svm <- tune.svm(Sex~.,data=inf.train.scaled, kernel="radial", gamma = 10^(-1:1), cost = 10^(-1:1))
tuned.svm$best.model
##
## Call:
## best.svm(x = Sex ~ ., data = inf.train.scaled, gamma = 10^(-1:1),
## cost = 10^(-1:1), kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 1335
train.predicted_vals <- predict(tuned.svm$best.model, newdata=inf.train.scaled[c("Length","Diameter","Height")], decision.values=TRUE)
# Metrics
cf.train <- caret::confusionMatrix(data=train.predicted_vals, reference=inf.train.scaled$Sex ,mode = "everything")
cf.train
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Infant I
## Not Infant 1785 400
## I 199 538
##
## Accuracy : 0.795
## 95% CI : (0.7799, 0.8095)
## No Information Rate : 0.679
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5016
##
## Mcnemar's Test P-Value : 3.039e-16
##
## Sensitivity : 0.8997
## Specificity : 0.5736
## Pos Pred Value : 0.8169
## Neg Pred Value : 0.7300
## Precision : 0.8169
## Recall : 0.8997
## F1 : 0.8563
## Prevalence : 0.6790
## Detection Rate : 0.6109
## Detection Prevalence : 0.7478
## Balanced Accuracy : 0.7366
##
## 'Positive' Class : Not Infant
##
The scores in terms of F1 score and accuracy is similar to the train logistic scores however from the confusion matrix th SVM is able to better classify infants. lets see the test set.
test.predicted_vals <- predict(tuned.svm$best.model, newdata=inf.test.scaled[c("Length","Diameter","Height")], decision.values=TRUE)
# Metrics
cf.test <- caret::confusionMatrix(data=test.predicted_vals, reference=inf.test.scaled$Sex, ,mode = "everything")
cf.test
## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Infant I
## Not Infant 762 177
## I 87 225
##
## Accuracy : 0.789
## 95% CI : (0.7653, 0.8113)
## No Information Rate : 0.6787
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4859
##
## Mcnemar's Test P-Value : 4.312e-08
##
## Sensitivity : 0.8975
## Specificity : 0.5597
## Pos Pred Value : 0.8115
## Neg Pred Value : 0.7212
## Precision : 0.8115
## Recall : 0.8975
## F1 : 0.8523
## Prevalence : 0.6787
## Detection Rate : 0.6091
## Detection Prevalence : 0.7506
## Balanced Accuracy : 0.7286
##
## 'Positive' Class : Not Infant
##
Again SVM is much better at classifying Infants.
Both models performed better then the first scenario due to more ‘separable’ data set. Support vector machine should be chosen as the go to model to predict which abalone is an infant to avoid harvesting them.
Female abalone detection can be highly profitable , as such lets examine how reliably we can predict them