What to expect from this script?

Generally speaking the scripts I have seen are to solve the auto crop and adjust the image for processing. On the other hand, I thought of creating some metrics of geometry for each species, and if someone can mark the suggested points, then the theoretical metrics become viable to generate the machine learning model.

To mark the points I used the program GeoGebra, free. Basically we will have to create points A, B, C, D, E, F, G, H and I to create measures that do not depend on the order of magnitude. Angles and ratios will be created between segments. The angle has an advantage that its unit does not depend on the size of the object, as well as the reason, in which the unit is canceled, generating an association rate between the objects.

Understanding the markings.

Point A is marked on the eye, B on the mouth, C on the tail end, F on the highest point, G on the lowest point. From this we create the segments AB, AC, to calculate the angle BAC. BC serves to create parallel lines passing through G and F, generating the perpendicular lines IH. The CD and CE lines serve to calculate the DCE angle.

It is natural that the model, being theoretical seems to be a fair way to separate the types of fish, however when collecting the measurements of the images, there will be small differences because of the centroids of automatic recognition of the key points.

This will give us a random variable for each proposed measure, with unknown distribution (since we do not yet have actual data to infer). From now on, I believe that the next step is to test for a kernel curve if the distributions created are statistically significant. Suppose it is, then these measures will be a good way to generate the best classification model.

Let’s look at more details without the background image.

The proposed metrics are:

  1. Angles in radians: \(\measuredangle BAC\), \(\measuredangle DCE\).
  2. Ratios: \(r_1 = \frac{BC}{HI}\), \(r_2 = \frac{BC}{AC}\), \(r_3 = \frac{BC}{AB}\), \(r_4 = \frac{BC*CE}{CD}\)

Let’s calculate in the theoretical models.

Theoretical metrics
Species BAC DCE var.a r1 r2 r3 r4 var.r
ALB 2.910861 1.923703 0.4872411 3.28125 1.166667 6.774193 2.1 6.019252

Calculating for all species proposed.

Theoretical metrics - All Species
Species BAC DCE var.a r1 r2 r3 r4 var.r
ALB 2.910861 1.923703 0.4872411 3.281250 1.166667 6.774193 2.100000 6.019252
BET 3.235493 1.830677 0.9867542 3.433333 1.197674 6.058823 2.288889 4.352921
DOL 2.985736 1.402547 1.2532435 3.366667 1.174419 6.733333 2.200357 5.833552
LAG 2.697582 1.387712 0.8578799 1.533333 1.200000 5.520000 1.440000 4.281644
SHARK 3.266036 1.320342 1.8928623 5.894737 1.166667 7.000000 1.178947 9.477292
YFT 3.064101 1.430473 1.3343717 3.690476 1.115108 9.687500 1.937500 14.991206

Are the measures different from each other?

Let us test if there is difference between the distribution of the core curve of the \(\measuredangle BAC\) and \(\measuredangle DCE\) angles afterwards between the measures \(r_1\), \(r_2\), \(r_3\) and \(r_4\). The kernel will be calculated and the Kolmogorov-Smirnov two-sided test will be applied under the hypothesis: \(H_0:\) The data follow the same distribution. They were not considered var.a and var.r for computation of these data because they are metrics dependent on the proposed measures. To differentiate between species, these were considered.

The theoretical result is interesting because it indicates that the angles are statistically different and the Ratios are also different.

When comparing species we also have good news. Only the BET and LAG (\(e_{24}\)) species have a remarkable similarity when considering the proposed metrics. That is. The method has evidence that it really can be a way of classifying species.

For the purpose of reading consider the numbering being the same sequence of the species of the table “Theoretical metrics - All Species”, that is in \(e_{12}\) being compared ALB and BET.

P-values - Kolmogorov-Smirnov two-sided test between metrics
Tested P-value
a12 0.0e+00
r12 0.0e+00
r13 0.0e+00
r14 0.0e+00
r23 0.0e+00
r24 0.0e+00
r34 1.6e-06
P-values - Kolmogorov-Smirnov two-sided test between species
Tested P-value
e12 0.000000
e13 0.000002
e14 0.000007
e15 0.005020
e16 0.019884
e23 0.000455
e24 0.368245
e25 0.000000
e26 0.000000
e34 0.003922
e35 0.000000
e36 0.000000
e45 0.000000
e46 0.000000
e56 0.016006

And if we remove var.a and var.r, what happens?

The interesting point is that if we remove the var.a and var.r columns, we will be able to easily differentiate the BET and LAG species that were previously similar, according to the measurements.

So in theory, it is enough to use the measures proposed in the learning model and then, specifically for ALB and BET, to recalculate disregarding “var.a” and “var.r”, to differentiate the species.

The dissimilarity matrix indicates a reasonable difference between species using the suggested measures and the dendogram clearly segregates the groups involved.

P-values - Kolmogorov-Smirnov two-sided test between species (else var.a and var.r)
Tested P-value
e12 0.967440
e13 0.006393
e14 0.054592
e15 0.000000
e16 0.000003
e23 0.000606
e24 0.078958
e25 0.000000
e26 0.000000
e34 0.000606
e35 0.000000
e36 0.000053
e45 0.000000
e46 0.000002
e56 0.000252
## Dissimilarities :
##             ALB       BET       DOL       LAG     SHARK
## BET   0.3309093                                        
## DOL   0.3076542 0.3341802                              
## LAG   0.4864959 0.4595448 0.4061729                    
## SHARK 0.6034372 0.5753474 0.4533110 0.5906772          
## YFT   0.5637891 0.6027002 0.4239479 0.6663829 0.5434425
## 
## Metric :  mixed ;  Types = N, I, I, I, I, I, I, I, I 
## Number of objects : 6