Generally speaking the scripts I have seen are to solve the auto crop and adjust the image for processing. On the other hand, I thought of creating some metrics of geometry for each species, and if someone can mark the suggested points, then the theoretical metrics become viable to generate the machine learning model.
To mark the points I used the program GeoGebra, free. Basically we will have to create points A, B, C, D, E, F, G, H and I to create measures that do not depend on the order of magnitude. Angles and ratios will be created between segments. The angle has an advantage that its unit does not depend on the size of the object, as well as the reason, in which the unit is canceled, generating an association rate between the objects.
Point A is marked on the eye, B on the mouth, C on the tail end, F on the highest point, G on the lowest point. From this we create the segments AB, AC, to calculate the angle BAC. BC serves to create parallel lines passing through G and F, generating the perpendicular lines IH. The CD and CE lines serve to calculate the DCE angle.
It is natural that the model, being theoretical seems to be a fair way to separate the types of fish, however when collecting the measurements of the images, there will be small differences because of the centroids of automatic recognition of the key points.
This will give us a random variable for each proposed measure, with unknown distribution (since we do not yet have actual data to infer). From now on, I believe that the next step is to test for a kernel curve if the distributions created are statistically significant. Suppose it is, then these measures will be a good way to generate the best classification model.
The proposed metrics are:
| Species | BAC | DCE | var.a | r1 | r2 | r3 | r4 | var.r |
|---|---|---|---|---|---|---|---|---|
| ALB | 2.910861 | 1.923703 | 0.4872411 | 3.28125 | 1.166667 | 6.774193 | 2.1 | 6.019252 |
| Species | BAC | DCE | var.a | r1 | r2 | r3 | r4 | var.r |
|---|---|---|---|---|---|---|---|---|
| ALB | 2.910861 | 1.923703 | 0.4872411 | 3.281250 | 1.166667 | 6.774193 | 2.100000 | 6.019252 |
| BET | 3.235493 | 1.830677 | 0.9867542 | 3.433333 | 1.197674 | 6.058823 | 2.288889 | 4.352921 |
| DOL | 2.985736 | 1.402547 | 1.2532435 | 3.366667 | 1.174419 | 6.733333 | 2.200357 | 5.833552 |
| LAG | 2.697582 | 1.387712 | 0.8578799 | 1.533333 | 1.200000 | 5.520000 | 1.440000 | 4.281644 |
| SHARK | 3.266036 | 1.320342 | 1.8928623 | 5.894737 | 1.166667 | 7.000000 | 1.178947 | 9.477292 |
| YFT | 3.064101 | 1.430473 | 1.3343717 | 3.690476 | 1.115108 | 9.687500 | 1.937500 | 14.991206 |
Let us test if there is difference between the distribution of the core curve of the \(\measuredangle BAC\) and \(\measuredangle DCE\) angles afterwards between the measures \(r_1\), \(r_2\), \(r_3\) and \(r_4\). The kernel will be calculated and the Kolmogorov-Smirnov two-sided test will be applied under the hypothesis: \(H_0:\) The data follow the same distribution. They were not considered var.a and var.r for computation of these data because they are metrics dependent on the proposed measures. To differentiate between species, these were considered.
The theoretical result is interesting because it indicates that the angles are statistically different and the Ratios are also different.
When comparing species we also have good news. Only the BET and LAG (\(e_{24}\)) species have a remarkable similarity when considering the proposed metrics. That is. The method has evidence that it really can be a way of classifying species.
For the purpose of reading consider the numbering being the same sequence of the species of the table “Theoretical metrics - All Species”, that is in \(e_{12}\) being compared ALB and BET.
| Tested | P-value |
|---|---|
| a12 | 0.0e+00 |
| r12 | 0.0e+00 |
| r13 | 0.0e+00 |
| r14 | 0.0e+00 |
| r23 | 0.0e+00 |
| r24 | 0.0e+00 |
| r34 | 1.6e-06 |
| Tested | P-value |
|---|---|
| e12 | 0.000000 |
| e13 | 0.000002 |
| e14 | 0.000007 |
| e15 | 0.005020 |
| e16 | 0.019884 |
| e23 | 0.000455 |
| e24 | 0.368245 |
| e25 | 0.000000 |
| e26 | 0.000000 |
| e34 | 0.003922 |
| e35 | 0.000000 |
| e36 | 0.000000 |
| e45 | 0.000000 |
| e46 | 0.000000 |
| e56 | 0.016006 |
The interesting point is that if we remove the var.a and var.r columns, we will be able to easily differentiate the BET and LAG species that were previously similar, according to the measurements.
So in theory, it is enough to use the measures proposed in the learning model and then, specifically for ALB and BET, to recalculate disregarding “var.a” and “var.r”, to differentiate the species.
The dissimilarity matrix indicates a reasonable difference between species using the suggested measures and the dendogram clearly segregates the groups involved.
| Tested | P-value |
|---|---|
| e12 | 0.967440 |
| e13 | 0.006393 |
| e14 | 0.054592 |
| e15 | 0.000000 |
| e16 | 0.000003 |
| e23 | 0.000606 |
| e24 | 0.078958 |
| e25 | 0.000000 |
| e26 | 0.000000 |
| e34 | 0.000606 |
| e35 | 0.000000 |
| e36 | 0.000053 |
| e45 | 0.000000 |
| e46 | 0.000002 |
| e56 | 0.000252 |
## Dissimilarities :
## ALB BET DOL LAG SHARK
## BET 0.3309093
## DOL 0.3076542 0.3341802
## LAG 0.4864959 0.4595448 0.4061729
## SHARK 0.6034372 0.5753474 0.4533110 0.5906772
## YFT 0.5637891 0.6027002 0.4239479 0.6663829 0.5434425
##
## Metric : mixed ; Types = N, I, I, I, I, I, I, I, I
## Number of objects : 6