Executive Summary

Study Overview: This analysis examines dairy cattle production data from Kenya, focusing on milk yield performance, reproductive efficiency, and data quality assessment for genetic evaluation.

Key Dataset Characteristics:

  • Milk Production Records: Multiple observations per animal across different lactations
  • Reproductive Traits: Age at first calving (AFC) and calving intervals (CI1-CI5)
  • Herd Management: Data from multiple herds with varying management practices

1. Data Loading and Preparation


2. Dataset Overview

Dataset Overview Statistics
Metric Value
Total Animals 9,204
Total Records 646,410
Number of Herds 7
Avg Records/Animal 70.2
Date Range 2001-11-15 to 2025-04-25

Data Quality Check: Dataset successfully loaded with 646,410 milk yield records from 9,204 animals across 7 herds.


3. Milk Yield Analysis

3.1 Descriptive Statistics

Milk Yield Descriptive Statistics (litres/day)
Statistic Value
N 646410.000
Mean 11.547
Median 11.000
SD 6.102
Variance 37.231
CV 52.844
Min 0.200
Q25 7.100
Q75 15.100
Max 60.000
IQR 8.000
Skewness 0.747
Kurtosis 4.029

3.2 Distribution Visualizations

3.3 Milk Yield by Herd

Milk Yield Statistics by Herd
Herd N Records Mean (litres) SD (litres)
KLBA 146293 14.20 6.98
MAKITOSHA 202285 14.19 5.16
STANLEY AND SON 96970 11.38 3.90
ADC 672 7.49 3.44
AVCD 72 7.32 2.60
KALRO NAIVASHA 177344 7.06 4.21
SAPLING 22774 6.77 4.08

4. Animal Records Summary

4.1 Top 20 Animals with Most Records

Top 20 Animals with Most Production Records
Rank Animal ID Records Mean (L) SD (L) Min (L) Max (L) Days Span
1 KALRO_2457 2360 8.95 4.69 0.2 21.2 3260
2 KALRO_2368 2268 8.42 3.84 0.2 21.2 3228
3 MAKI_367I/S 2246 14.48 4.75 2.1 31.1 3550
4 KALRO_2356 2239 7.27 4.55 0.2 19.8 3381
5 MAKI_371I/S 2232 16.16 5.63 1.5 34.5 3078
6 KALRO_2422 2189 9.05 3.76 0.2 20.0 3068
7 MAKI_364I/S 2124 19.91 6.66 2.0 39.5 3523
8 KALRO_2475 2119 8.71 3.73 1.0 20.0 2999
9 MAKI_106I/S 2112 19.39 6.67 1.9 36.5 4190
10 KALRO_2467 2040 8.59 3.81 0.2 19.8 2995
11 KALRO_2450 2034 7.49 4.33 0.2 19.6 3298
12 MAKI_107I/S 2028 19.02 6.09 1.0 37.5 3883
13 KALRO_2424 2027 7.73 4.06 0.6 19.2 3433
14 MAKI_377I/S 2009 13.13 5.62 0.9 30.6 3484
15 MAKI_346I/S 1985 15.14 4.91 1.6 27.8 2360
16 KALRO_2416 1949 7.37 3.82 0.2 19.4 3221
17 KALRO_0001 1937 7.13 3.95 0.4 20.0 2614
18 MAKI_341I/S 1929 15.79 5.13 4.3 31.8 3793
19 MAKI_249I/S 1917 17.73 5.16 4.5 31.4 2894
20 KALRO_2429 1892 9.42 5.00 0.6 22.4 3175

Top Producer: Animal KALRO_2457 has the most records with 2360 observations spanning 3260 days, with a mean yield of 8.95 litres/day.

4.2 Animals with Fewest Records

20 Animals with Fewest Production Records
Rank Animal ID Records Mean (L) SD (L) Min (L) Max (L) Days Span
1 10077_8743 1 2.0 NA 2.0 2.0 0
2 108_5404 1 13.5 NA 13.5 13.5 0
3 10_6 1 11.0 NA 11.0 11.0 0
4 116_06DFN02 1 8.0 NA 8.0 8.0 0
5 11_101801 1 6.0 NA 6.0 6.0 0
6 11_1100 1 17.0 NA 17.0 17.0 0
7 11_1ZR15 1 9.5 NA 9.5 9.5 0
8 11_1ZR32 1 6.2 NA 6.2 6.2 0
9 11_1ZR41 1 9.3 NA 9.3 9.3 0
10 11_2ZR60 1 12.0 NA 12.0 12.0 0
11 11_2ZR71 1 30.0 NA 30.0 30.0 0
12 11_2ZR74 1 7.5 NA 7.5 7.5 0
13 11_2ZR75 1 10.0 NA 10.0 10.0 0
14 11_2ZR80 1 8.0 NA 8.0 8.0 0
15 11_302202 1 11.5 NA 11.5 11.5 0
16 11_303103 1 15.0 NA 15.0 15.0 0
17 11_3ZR1 1 27.0 NA 27.0 27.0 0
18 11_3ZR10/13 1 21.0 NA 21.0 21.0 0
19 11_3ZR19 1 11.0 NA 11.0 11.0 0
20 11_3ZR2 1 10.0 NA 10.0 10.0 0

⚠️ Data Quality Concern: 534 animals have only 1 record (single observation), which limits their value for repeated measures analysis and breeding value estimation.

4.3 Distribution of Records per Animal

Summary: Animals Grouped by Record Frequency
Record Category Number of Animals Percentage (%)
1 Record 534 5.8
2-3 Records 545 5.9
4-5 Records 497 5.4
6-10 Records 1157 12.6
>10 Records 6471 70.3

5. Reproductive Performance

5.1 Age at First Calving (AFC)

Age at First Calving Statistics
Statistic Value (months)
N 9204.00
Mean 35.98
Median 34.00
SD 8.87
Min 21.00
Max 60.00
CV 24.64

5.2 Calving Intervals

Calving Interval Statistics
Interval N Mean (days) SD (days) Min (days) Max (days)
CI1 3880 465.6 117.9 251 800
CI2 2232 456.0 112.2 250 799
CI3 1273 444.7 105.4 259 796


6. Data Quality Assessment

⚠️ CRITICAL FINDING: Preliminary genetic analysis revealed an estimated heritability of 0.05 for milk yield, which is substantially lower than expected values (typically 0.20-0.35 for dairy cattle).

6.1 Heritability Interpretation

Heritability Estimate Comparison
Source Heritability Interpretation
Expected Range (Literature) 0.20 - 0.35 Normal genetic variation
Kenya Data (Current Estimate) 0.05 Very low - Data quality issues likely
Difference ▼ 0.15 - 0.30 Substantial underestimation

What Does Low Heritability Indicate?

A heritability estimate of 0.05 suggests that only 5% of the phenotypic variation in milk yield can be attributed to additive genetic effects. This unusually low value indicates:

  1. High Environmental Noise: Excessive variation from non-genetic factors (management, nutrition, health)
  2. Data Quality Issues: Potential problems with data recording, consistency, or completeness
  3. Model Misspecification: Fixed effects may not adequately account for systematic environmental variation
  4. Limited Genetic Variation: Possible inbreeding or lack of genetic diversity (less likely)

6.3 Outlier Detection

Outlier Analysis Summary
Value
Total_Records 646410.00
Outliers_3SD 5236.00
Percent_Outliers 0.81
Mean_With_Outliers 11.55
Mean_Without_Outliers 11.37
SD_With_Outliers 6.10
SD_Without_Outliers 5.78


7. Conclusions and Recommendations

Key Findings

  1. Dataset Size: The dataset contains 646,410 milk yield records from 9,204 animals across 7 herds

  2. Milk Yield Performance:

    • Mean: 11.55 litres/day
    • Standard Deviation: 6.1 litres
    • Coefficient of Variation: 52.8%
  3. Reproductive Performance:

    • Mean AFC: 36 months
    • Mean CI1: 465.6 days
  4. Critical Issue: Heritability = 0.05 (Expected: 0.20-0.35)

⚠️ ACTION REQUIRED

The extremely low heritability estimate (h² = 0.05) clearly indicates that data filtering and quality control are essential before proceeding with genetic evaluation. The current data contains excessive environmental noise that masks genetic variation.

Next Steps to be Taken:

Data Cleaning and Filtering: - Remove outliers beyond ±3 SD (5236 records identified) - Standardize records by days in milk (DIM 5-305) - Apply lactation stage corrections - Remove herds with <10 animals or high CV (>40%) - Remove animals with single records for repeatability analysis

Model Refinement: - Include year-season effects - Add age at recording as covariate - Consider herd-year-season contemporary groups - Implement proper repeatability model for repeated measures

Validation and Testing: - Re-estimate genetic parameters with cleaned data - Test multiple genetic models and compare fit statistics - Cross-validate parameter estimates using independent data subsets - Document all filtering decisions and their impacts

Expected Outcomes After Implementation: - Heritability improvement from 0.05 to 0.15-0.25 (realistic range) - Better breeding value accuracy and reliability - Reduced residual variance and improved model fit


8. Technical Appendix

8.1 Data Distribution by Parity

Milk Yield Distribution by Parity
Parity Records Animals Mean Yield (litres) SD (litres) % of Total
1 284373 9035 10.42 5.29 43.99
2 151668 4241 11.86 6.41 23.46
3 92835 2427 12.48 6.65 14.36
4 52135 1398 12.89 6.82 8.07
5 30886 771 13.10 6.45 4.78
6 19459 421 13.52 6.32 3.01
7 9434 226 13.13 5.87 1.46
8 3668 110 14.34 6.06 0.57
9 1556 47 13.78 5.97 0.24
10 276 14 14.66 6.58 0.04
11 106 5 13.12 4.16 0.02
12 14 1 13.06 3.35 0.00

8.3 Herd Size and Performance

Herd Performance Summary
Herd Animals Records Mean (litres) SD (litres) CV (%)
KLBA 6347 146293 14.20 6.98 49.13
SAPLING 1309 22774 6.77 4.08 60.22
STANLEY AND SON 755 96970 11.38 3.90 34.29
MAKITOSHA 488 202285 14.19 5.16 36.33
KALRO NAIVASHA 273 177344 7.06 4.21 59.64
ADC 24 672 7.49 3.44 45.87
AVCD 8 72 7.32 2.60 35.51
Note:
Herds with <10 animals highlighted in red - consider for removal

8.4 Coefficient of Variation by Herd


9. Statistical Summary Tables

9.1 Variance Components (Current Estimate)

Variance Components Breakdown (Current Dataset)
Variance Component Value % of Total Status
Additive Genetic (σ²a) 1.86 5.0% ❌ Too Low |
Residual (σ²e) 35.37 95.0% ❌ Too High |
Phenotypic (σ²p) 37.23 100.0%
Heritability (h²) 0.05 ❌ Below Expected |

9.2 Expected vs. Observed Comparison

Expected vs. Observed Genetic Parameters
Parameter Expected Range Observed Gap
Heritability (h²) 0.25 - 0.35 0.05 ⬇️ 80% lower
Genetic Variance 25 - 35% of σ²p 5% of σ²p ⬇️ 20-30% lower
Residual Variance 40 - 50% of σ²p 95% of σ²p ⬆️ 45-55% higher
Repeatability 0.50 - 0.60 Not estimated
Breeding Value Accuracy 0.70 - 0.80 <0.40 (estimated) ⬇️ 50% lower

10. Quality Control Metrics

Data Quality Control Summary
Quality Metric Count/Value Percentage Status
Records with complete data 646410 100% ✓ Good
Animals with ≥2 records 8670 94.2% ✓ Good
Herds with ≥10 animals 6 85.7% ✓ Good
Records within DIM 5-305 Not calculated ❌ Required |
Records within 3 SD 641174 99.2% ✓ Good
Lactations with calving date 646410 100% ✓ Good
Animals with pedigree info 162206 ✓ Available

11. Final Recommendations Summary

Prioritized Action Plan

🔴 CRITICAL PRIORITY

  1. Remove Outliers: Eliminate 5236 records beyond ±3 SD
  2. DIM Standardization: Calculate and apply 305-day lactation adjustments
  3. Preliminary Re-analysis: Re-estimate heritability with cleaned data

🟡 HIGH PRIORITY

  1. Herd Filtering: Remove herds with <10 animals or high CV (>40%)
  2. Contemporary Groups: Create herd-year-season groups for better environmental accounting
  3. Repeatability Model: Implement proper repeated measures analysis
  4. Remove Single-Record Animals: Exclude 534 animals with only 1 observation

🟢 MEDIUM PRIORITY

  1. Model Optimization: Test multiple genetic models and compare fit statistics
  2. Cross-validation: Validate parameter estimates using independent data subsets
  3. Documentation: Create detailed data quality and filtering report
  4. DIM Range Filtering: Restrict analysis to DIM 5-305 days
  5. Parity Standardization: Consider analyzing first three parities separately

📊 Expected Progression

Stage Heritability Estimate Status
Current (Unfiltered) 0.05 ± 0.02 ❌ Unacceptable
After Basic Filtering 0.12 - 0.18 ⚠️ Improving
After Full Optimization 0.20 - 0.30 ✓ Acceptable

11. Session Information

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_Kenya.utf8  LC_CTYPE=English_Kenya.utf8   
## [3] LC_MONETARY=English_Kenya.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_Kenya.utf8    
## 
## time zone: Africa/Nairobi
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] plotly_4.11.0    gridExtra_2.3    scales_1.4.0     kableExtra_1.4.0
##  [5] knitr_1.50       moments_0.14.1   ggplot2_3.5.2    tidyr_1.3.1     
##  [9] dplyr_1.1.4      readxl_1.4.5    
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     xml2_1.4.0         lattice_0.22-7    
##  [5] stringi_1.8.7      digest_0.6.37      magrittr_2.0.3     evaluate_1.0.5    
##  [9] grid_4.5.1         RColorBrewer_1.1-3 fastmap_1.2.0      Matrix_1.7-3      
## [13] cellranger_1.1.0   jsonlite_2.0.0     mgcv_1.9-3         httr_1.4.7        
## [17] purrr_1.1.0        viridisLite_0.4.2  lazyeval_0.2.2     textshaping_1.0.2 
## [21] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        splines_4.5.1     
## [25] withr_3.0.2        cachem_1.1.0       yaml_2.3.10        tools_4.5.1       
## [29] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4    stringr_1.5.1     
## [33] htmlwidgets_1.6.4  pkgconfig_2.0.3    pillar_1.11.0      bslib_0.9.0       
## [37] gtable_0.3.6       glue_1.8.0         data.table_1.17.8  systemfonts_1.2.3 
## [41] xfun_0.52          tibble_3.3.0       tidyselect_1.2.1   rstudioapi_0.17.1 
## [45] farver_2.1.2       nlme_3.1-168       htmltools_0.5.8.1  rmarkdown_2.29    
## [49] svglite_2.2.1      labeling_0.4.3     compiler_4.5.1

Report Generated: 2025-10-02 11:55:20.437079

Contact: Dairy Genetics Analysis Team

Note: This report must be reviewed before proceeding with genetic evaluation. The low heritability estimate requires immediate attention through data filtering and quality control measures.