Purpose: Shows the performance of the model in a normalized form, where values represent proportions instead of absolute numbers.
Insights:
• The diagonal values represent the proportion of correctly classified instances for each class.
• Estrus, lying, and standing classes show relatively high accuracy, as their diagonal values are closer to 1.
• Grazing has moderate confusion, particularly with “standing” (e.g., grazing is often mistaken as standing).
• The model struggles more with distinguishing background from the behaviours, as misclassifications for background are noticeable.
Action Point: Improve annotations or use data augmentation for classes with high misclassification.
Purpose: Displays the absolute number of correct and incorrect predictions for each class.
Insights: • Shows the absolute number of predictions for each class.
• Highlights class imbalance in the dataset. For example, lying and standing have more data points compared to estrus.
• Grazing is frequently misclassified as standing, indicating overlapping visual features or poor annotation consistency.
• Action Point: Add more diverse training examples for underrepresented behaviours like estrus and grazing.
Purpose: Shows the F1 score (harmonic mean of precision and recall) across different confidence thresholds.
Insights: • The F1 score for “lying” is the highest,
indicating the model is performing very well for this behaviour. •
Estrus has the lowest F1 score, suggesting the model struggles to
balance precision and recall for this class. • The overall F1 score for
all classes is reasonable, but there’s room for improvement, especially
for estrus and grazing. Action Point: Focus on
increasing the quality and quantity of data for estrus. Adjust
hyperparameters or try different augmentation strategies to reduce
confusion with similar behaviours.
Purpose: Visual representation of label distributions, including bounding box characteristics (e.g., width, height, x, y coordinates).
Insights: • Bounding boxes are distributed in a logical way (e.g., no extreme outliers).
• However, certain areas might have clustering, indicating potential biases in the dataset or specific patterns in how the data was annotated.
• Width and height distributions are generally uniform, but overlapping boxes may need further scrutiny.
Action Point: Check for annotation biases and ensure
consistent labeling practices.
Purpose: Includes class distribution, annotation heatmaps, and bounding box patterns.
Insights: • Highlights class imbalance, with standing and lying dominating the dataset while estrus has the least examples.
• Shows the spatial distribution of bounding boxes, with clustering in certain areas suggesting biases in the data collection process.
Action Point: Add more data for estrus and grazing and ensure a more even distribution of annotations to improve generalization.
Purpose: Displays how precision changes with varying confidence thresholds.
Insights:
• Shows precision values across different confidence levels for each class.
• The precision for “lying” and “standing” is strong, suggesting the model makes fewer false-positive predictions for these behaviours.
• Estrus and grazing have lower precision, indicating more false positives in these classes.
Action Point: Refine annotations and increase the dataset size for estrus and grazing. Additionally, investigate misclassifications to identify overlaps between behaviours.
Purpose: Illustrates the recall values across different confidence thresholds.
Insights:
• Lying and standing have high AUC values, showing excellent precision-recall trade-off.
• Grazing and estrus have lower AUC values, suggesting difficulty in balancing precision and recall.
• The model has high recall for lying and standing but lower precision for grazing and estrus.
Action Point: Enhance data quality for grazing and estrus by collecting more diverse examples and improving annotation consistency.
Purpose: Illustrates the recall values across different confidence thresholds.
Insights:
• Recall for lying and standing is strong across confidence levels.
• Estrus has lower recall, indicating the model fails to identify many instances of this class.
• Grazing recall is moderate but drops significantly at higher confidence thresholds, showing the model struggles with confident predictions for grazing.
Action Point: Focus on increasing recall for underrepresented behaviors through balanced datasets and fine-tuning.
Purpose: Tracks training and validation metrics over epochs, including losses and mAP.
Insights:
• Training Loss: Decreases steadily over epochs, indicating effective learning. The validation loss follows a similar pattern, which is a good sign.
Metrics: mAP (mean average precision) values show improvement, with mAP50 reaching close to 0.9 and mAP50-95 around 0.6. Indicates the model has a good balance of precision and recall but still struggles with some challenging classes.
Action Point: To improve mAP50-95, focus on multi-scale training and incorporate hard-negative mining to handle challenging examples.