Phenotypic Analysis Complete
Phenotypic Analysis of Dairy Cattle Traits in Kenya
ADGG/TDGG Project — From Unstructured Field Data to Analysis-Ready Datasets

Author: Setegn Worku Alemu  •  Generated: February 22, 2026  •  Traits: AFC DMY TLY CI


1 Project Pipeline

1
Raw Data Cleaning
3 CSV sources merged & harmonised
2
QC Filtering
Biological limits applied
3
Descriptive Analysis
4 traits characterised
4
Fixed Effects Models
OLS & ANOVA complete
5
Genetic Parameters
BLUPF90 — next step
6
EBV Estimation
Bull & cow ranking
Current Status

Steps 1–4 are complete and documented in this report. The data are now structured, cleaned, and characterised — ready for variance component estimation (Step 5) using BLUPF90.


2 Introduction

2.1 Background

The ADGG/TDGG projects (Bill & Melinda Gates Foundation / ILRI) aim to accelerate genetic gains in smallholder dairy systems in East Africa. Kenyan farms predominantly keep crossbred cows (local × exotic) that balance tropical adaptation with improved milk potential.

Accurate genetic evaluations require a well-characterised phenotypic foundation. This report documents the complete pipeline from raw field data to analysis-ready trait datasets for four key traits:

Abbreviation Full name Unit
AFC Age at First Calving months
DMY Daily Milk Yield (test-day) kg per day
TLY Total Lactation Yield kg per lactation
CI Calving Interval days

2.2 Objectives

  1. Merge and harmonise three raw data files into trait-specific datasets.
  2. Apply quality-control filters with biologically justified thresholds.
  3. Characterise all four traits (descriptive statistics, distributions).
  4. Quantify systematic environmental effects (Region, Breed, Parity, DIM) via linear models.
  5. Confirm sufficient residual variance for genetic evaluation.

3 Data Loading

Three Input Files
  • pedigree-extract.csv — animal identity, parentage, breed, farm location.
  • lactation_calving_improved(in).csv — one row per calving: AFC, TLY, CI.
  • milking_data_improved.csv — one row per test-day visit: daily milk, DIM. Authoritative source for breed name labels.

3.1 File Paths

data_path <- "C:/Users/SAlemu/OneDrive - CGIAR/Documents/KenyaData/New_Kenya_data/Version2/"
output_path <- paste0(data_path, "BLUPF90CL/")
if (!dir.exists(output_path)) dir.create(output_path, recursive = TRUE)
cat("Data path :", data_path, "\nOutput path:", output_path, "\n")
Data path : C:/Users/SAlemu/OneDrive - CGIAR/Documents/KenyaData/New_Kenya_data/Version2/ 
Output path: C:/Users/SAlemu/OneDrive - CGIAR/Documents/KenyaData/New_Kenya_data/Version2/BLUPF90CL/ 

3.2 Read Data Files

pedigree <- fread(paste0(data_path, "pedigree-extract.csv"),
  na.strings = c("NULL","null","NA","N/A",""," "), stringsAsFactors = FALSE)
lactation <- fread(paste0(data_path, "lactation_calving_improved(in).csv"),
  na.strings = c("NULL","null","NA","N/A",""," "), stringsAsFactors = FALSE)
milking <- fread(paste0(data_path, "milking_data_improved.csv"),
  na.strings = c("NULL","null","NA","N/A",""," "), stringsAsFactors = FALSE)

cat("Records loaded:\n")
Records loaded:
cat("  Pedigree  :", format(nrow(pedigree),  big.mark=","), "rows,", ncol(pedigree),  "columns\n")
  Pedigree  : 248,208 rows, 29 columns
cat("  Lactation :", format(nrow(lactation), big.mark=","), "rows,", ncol(lactation), "columns\n")
  Lactation : 88,352 rows, 34 columns
cat("  Milking   :", format(nrow(milking),   big.mark=","), "rows,", ncol(milking),   "columns\n")
  Milking   : 1,048,575 rows, 36 columns

3.3 Data Structure

head(pedigree, 8) %>%
  select(animal_id, main_breed, region, district, farm_id, birthdate, sex) %>%
  kable(caption = "Sample pedigree records") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Sample pedigree records
animal_id main_breed region district farm_id birthdate sex
170285 3 Migori Kuria West 515160 7/27/2016 FEMALE
170298 3 Kakamega Likuyani 515626 6/15/2016 FEMALE
170315 3 Kakamega Lugari 598316 5/18/2018 FEMALE
170328 3 Kakamega Likuyani 595412 05/08/2018 FEMALE
170506 26 Homa Bay Ndhiwa 514950 05/04/2017 FEMALE
170513 23 Homa Bay Ndhiwa 515024 5/29/2019 FEMALE
170515 15 Migori Kuria West 510092 01/07/2018 FEMALE
170554 9 Kakamega Likuyani 616901 05/12/2017 FEMALE
head(lactation, 8) %>%
  select(animal_id, parity, calving_date, afc_months, calving_interval_days, total_lactation_yield) %>%
  kable(caption = "Sample lactation records (one row = one calving event)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Sample lactation records (one row = one calving event)
animal_id parity calving_date afc_months calving_interval_days total_lactation_yield
170285 1 11/23/1993 -273 NA 11.1
170285 2 10/9/1994 -273 320 35.4
170285 3 8/4/1996 -273 665 NA
170285 4 4/7/1997 -273 246 5.3
170285 5 9/9/1997 -273 155 10.4
170285 6 12/19/1998 -273 466 25.0
170285 7 1/1/2005 -273 2205 8.9
170285 8 10/13/2007 -273 1015 5.0
head(milking, 8) %>%
  select(animal_id, parity, milking_date, dim, milk_morning, milk_evening, daily_yield) %>%
  kable(caption = "Sample test-day records (one row = one farm visit)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Sample test-day records (one row = one farm visit)
animal_id parity milking_date dim milk_morning milk_evening daily_yield
170285 1 15/12/1993 22 1.6 0 1.6
170285 1 15/01/1994 53 1.6 0 1.6
170285 1 15/02/1994 84 1.2 0 1.2
170285 1 15/03/1994 112 1.2 0 1.2
170285 1 15/04/1994 143 1.7 0 1.7
170285 1 15/05/1994 173 0.7 0 0.7
170285 1 15/06/1994 204 1.5 0 1.5
170285 1 15/07/1994 234 1.6 0 1.6

4 Geographic and Breed Structure

4.1 Building the Location Lookup

pedigree_factors <- pedigree %>%
  mutate(animal_id = as.character(animal_id)) %>%
  select(animal_id, country, region, district, ward, village, farm_id, main_breed) %>%
  group_by(animal_id) %>% slice(1) %>% ungroup()

breed_lookup <- milking %>%
  filter(!is.na(main_breed), !is.na(main_breed_name),
         trimws(as.character(main_breed_name)) != "") %>%
  mutate(main_breed = as.integer(main_breed),
         main_breed_name = trimws(as.character(main_breed_name))) %>%
  distinct(main_breed, main_breed_name)

pedigree_factors <- pedigree_factors %>%
  mutate(main_breed = as.integer(main_breed)) %>%
  left_join(breed_lookup, by = "main_breed") %>%
  mutate(main_breed_name = ifelse(is.na(main_breed_name),
                                  paste0("Breed_", main_breed), main_breed_name))

region_levels   <- pedigree_factors %>% filter(!is.na(region))  %>% distinct(region)  %>% mutate(region_code = row_number())
district_levels <- pedigree_factors %>% filter(!is.na(district)) %>% distinct(district) %>% mutate(district_code = row_number())
farm_levels     <- pedigree_factors %>% filter(!is.na(farm_id))  %>% distinct(farm_id)  %>% mutate(farm_code = row_number())
breed_levels    <- pedigree_factors %>% filter(!is.na(main_breed)) %>% distinct(main_breed, main_breed_name) %>% mutate(breed_code = row_number())

pedigree_factors_coded <- pedigree_factors %>%
  left_join(region_levels, by="region") %>% left_join(district_levels, by="district") %>%
  left_join(farm_levels, by="farm_id") %>% left_join(breed_levels, by=c("main_breed","main_breed_name"))

data.frame(
  Factor = c("Countries","Regions","Districts","Wards","Villages","Farms/Herds","Breeds"),
  `Number of levels` = c(n_distinct(na.omit(pedigree_factors$country)),
    nrow(region_levels), nrow(district_levels),
    n_distinct(na.omit(pedigree_factors$ward)),
    n_distinct(na.omit(pedigree_factors$village)),
    nrow(farm_levels), nrow(breed_levels)), check.names = FALSE
) %>%
  kable(caption = "Number of unique levels for each factor", align = c("l","r")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE)
Number of unique levels for each factor
Factor Number of levels
Countries 2
Regions 45
Districts 211
Wards 505
Villages 1050
Farms/Herds 28018
Breeds 40

4.2 Geographic Hierarchy

pedigree_factors %>%
  filter(!is.na(region), !is.na(district), !is.na(farm_id)) %>%
  group_by(region) %>%
  summarise(Districts = n_distinct(district), Wards = n_distinct(ward),
            Villages = n_distinct(village), Farms = n_distinct(farm_id),
            Animals = n_distinct(animal_id), .groups = "drop") %>%
  arrange(desc(Animals)) %>%
  kable(caption = "Geographic hierarchy: sub-units per region",
        format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#2b6cb0", color = "white")
Geographic hierarchy: sub-units per region
region Districts Wards Villages Farms Animals
Nakuru 11 25 21 1,187 27,146
Kiambu 10 14 1 450 7,685
Laikipia 2 4 1 140 7,363
Muranga 7 8 1 1,501 6,011
Nyandarua 7 18 12 1,053 5,630
Makueni 7 33 180 3,256 4,965
Kisumu 7 37 115 1,364 4,160
Uasin Gishu 6 10 2 586 4,160
Narok 5 9 12 95 3,318
Nyeri 5 6 1 459 3,097
Kakamega 11 34 83 1,168 3,094
Nandi County 5 8 9 549 2,484
Siaya 6 28 123 1,033 2,374
Meru 5 5 1 395 1,641
Homa Bay 8 36 105 835 1,571
Trans Nzoia 4 5 1 35 1,570
Taita Taveta 5 12 20 174 1,566
Kilifi 2 3 2 5 1,551
Busia 7 33 76 701 1,499
Migori 9 27 63 819 1,488
Kajiado 4 8 11 82 1,355
Nairobi 9 8 2 28 1,325
Baringo 5 7 1 99 1,234
Tharaka Nithi 2 3 1 197 973
Vihiga 5 23 60 419 904
Kitui 8 22 59 504 816
Bomet 5 6 1 106 699
Kirinyaga 3 3 1 78 684
Nyamira 4 6 1 193 656
Machakos 5 8 1 102 614
Bungoma 8 38 59 344 585
Kisii 9 34 63 163 394
Embu 2 1 1 63 344
Kericho 4 9 1 49 294
West Pokot 1 1 1 8 112
Marsabit 1 1 1 13 57
Kwale 1 1 1 1 44
Oromia 1 1 1 1 18
Elgeyo Marakwet 3 3 1 4 13
Amhara 2 3 3 3 5
Osun 1 1 1 1 1

Note: The hierarchy is strictly nested. We use Region + Farm to represent geography without collinearity.

4.3 Breed Distribution

breed_dist <- pedigree_factors %>%
  filter(!is.na(main_breed_name)) %>% count(main_breed_name, name = "Count") %>%
  arrange(desc(Count)) %>%
  mutate(Percentage = round(Count / sum(Count) * 100, 1),
         main_breed_name = factor(main_breed_name, levels = main_breed_name))

ggplot(breed_dist, aes(x = main_breed_name, y = Count, fill = main_breed_name)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.5, size = 3.5) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Distribution of Animals by Breed",
       subtitle = paste("Total animals:", format(sum(breed_dist$Count), big.mark = ",")),
       x = "Breed", y = "Number of Animals") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

breed_dist %>% rename(Breed = main_breed_name, `N Animals` = Count, `%` = Percentage) %>%
  kable(caption = "Breed composition", format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Breed composition
Breed N Animals %
Holstein Friesian 109,158 41.4
Ayrshire 32,484 12.3
Sahiwal 20,880 7.9
Boran 18,976 7.2
FFFB 15,203 5.8
FFFF 15,203 5.8
Breed_0 12,611 4.8
Jersey 8,631 3.3
Guernsey 6,293 2.4
Holstein-Friesian crosses 3,726 1.4
Unknown 3,484 1.3
Others 3,341 1.3
Breed_17 2,824 1.1
Local Zebu 2,169 0.8
Sahiwal crosses 1,143 0.4
Breed_43 1,010 0.4
Brown Swiss 907 0.3
Gir 707 0.3
Breed_10 650 0.2
Breed_80 634 0.2
Zebu 498 0.2
Red Poll 443 0.2
Breed_35 434 0.2
Ayrshire crosses 381 0.1
Fleckvieh 337 0.1
Hereford 258 0.1
Breed_87 140 0.1
Breed_36 136 0.1
Breed_1 125 0.0
Jersey crosses 116 0.0
Magic 89 0.0
Breed_38 87 0.0
Breed_85 72 0.0
Girolando 66 0.0
Breed_86 64 0.0
Scandinavian Red 53 0.0
Guernsey crosses 35 0.0
Breed_12 28 0.0
Breed_77 6 0.0
Breed_16 5 0.0
Breed_72 2 0.0
Breed_74 2 0.0

5 Quality-Control Filtering & Trait Datasets

5.1 Filtering Rules

Trait Minimum Maximum Biological Rationale
AFC (months) 18 72 Puberty ≥ 18 mo; > 72 mo = recording error
Daily yield (kg) 0.5 60 < 0.5 = missed milking; > 60 implausible
DIM (days) 5 500 Exclude colostrum; > 500 = extended lactation
TLY (kg) 100 20,000 < 100 = incomplete; > 20k implausible
CI (days) 280 900 Min = gestation; max ≈ 2.5 years

5.2 Build Trait Datasets

ped_slim <- pedigree_factors_coded %>%
  select(animal_id, region, region_code, farm_id, farm_code,
         main_breed, main_breed_name, breed_code)

afc_data <- lactation %>%
  mutate(animal_id = as.character(animal_id), afc_months = as.numeric(afc_months),
         parity = as.numeric(parity)) %>%
  filter(parity == 1, !is.na(afc_months), afc_months >= 18, afc_months <= 72) %>%
  select(-any_of(c("region","farm_id","main_breed","main_breed_name"))) %>%
  left_join(ped_slim, by = "animal_id") %>%
  filter(!is.na(region), !is.na(farm_id), !is.na(main_breed_name)) %>%
  group_by(animal_id) %>% slice(1) %>% ungroup() %>%
  mutate(region = factor(region), farm_id = factor(farm_id),
         main_breed = factor(main_breed), main_breed_name = factor(main_breed_name))

daily_milk_data <- milking %>%
  mutate(animal_id = as.character(animal_id), daily_yield = as.numeric(daily_yield),
         dim = as.numeric(dim), parity = as.numeric(parity)) %>%
  filter(!is.na(daily_yield), daily_yield > 0.5, daily_yield < 60,
         !is.na(dim), dim >= 5, dim <= 500) %>%
  select(-any_of(c("region","farm_id","main_breed","main_breed_name"))) %>%
  left_join(ped_slim, by = "animal_id") %>%
  filter(!is.na(region), !is.na(farm_id), !is.na(main_breed_name)) %>%
  mutate(parity_class = factor(ifelse(is.na(parity)|parity<1, 1, ifelse(parity>4, 4, parity))),
         dim_class = factor(ifelse(ceiling(dim/30) > 17, 17, ceiling(dim/30))),
         region = factor(region), farm_id = factor(farm_id),
         main_breed = factor(main_breed), main_breed_name = factor(main_breed_name))

tly_data <- lactation %>%
  mutate(animal_id = as.character(animal_id),
         total_lactation_yield = as.numeric(total_lactation_yield),
         parity = as.numeric(parity)) %>%
  filter(!is.na(total_lactation_yield), total_lactation_yield >= 100,
         total_lactation_yield <= 20000) %>%
  select(-any_of(c("region","farm_id","main_breed","main_breed_name"))) %>%
  left_join(ped_slim, by = "animal_id") %>%
  filter(!is.na(region), !is.na(farm_id), !is.na(main_breed_name)) %>%
  mutate(parity_class = factor(ifelse(is.na(parity)|parity<1, 1, ifelse(parity>4, 4, parity))),
         region = factor(region), farm_id = factor(farm_id),
         main_breed = factor(main_breed), main_breed_name = factor(main_breed_name))

ci_data <- lactation %>%
  mutate(animal_id = as.character(animal_id),
         calving_interval_days = as.numeric(calving_interval_days),
         parity = as.numeric(parity)) %>%
  filter(parity > 1, !is.na(calving_interval_days),
         calving_interval_days >= 280, calving_interval_days <= 900) %>%
  select(-any_of(c("region","farm_id","main_breed","main_breed_name"))) %>%
  left_join(ped_slim, by = "animal_id") %>%
  filter(!is.na(region), !is.na(farm_id), !is.na(main_breed_name)) %>%
  mutate(parity_class = factor(ifelse(parity > 4, 4, parity)),
         region = factor(region), farm_id = factor(farm_id),
         main_breed = factor(main_breed), main_breed_name = factor(main_breed_name))

data.frame(
  Trait = c("Age at First Calving","Daily Milk Yield","Total Lactation Yield","Calving Interval"),
  Abbrev. = c("AFC","DMY","TLY","CI"),
  Records = c(nrow(afc_data), nrow(daily_milk_data), nrow(tly_data), nrow(ci_data)),
  Animals = c(n_distinct(afc_data$animal_id), n_distinct(daily_milk_data$animal_id),
              n_distinct(tly_data$animal_id), n_distinct(ci_data$animal_id)),
  Farms = c(n_distinct(afc_data$farm_id), n_distinct(daily_milk_data$farm_id),
            n_distinct(tly_data$farm_id), n_distinct(ci_data$farm_id)),
  Regions = c(n_distinct(afc_data$region), n_distinct(daily_milk_data$region),
              n_distinct(tly_data$region), n_distinct(ci_data$region))
) %>%
  kable(caption = "Data after quality-control filtering",
        format.args = list(big.mark = ","), align = c("l","c","r","r","r","r")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#2b6cb0", color = "white")
Data after quality-control filtering
Trait Abbrev. Records Animals Farms Regions
Age at First Calving AFC 27,669 27,669 1,247 35
Daily Milk Yield DMY 584,617 15,478 651 35
Total Lactation Yield TLY 9,573 4,941 356 22
Calving Interval CI 44,250 17,635 453 32

6 Descriptive Statistics

6.1 Overall Summary

summary_stats <- data.frame(
  Trait = c("AFC (months)","Daily Milk (kg)","Lactation Milk (kg)","Calving Interval (days)"),
  N = c(nrow(afc_data), nrow(daily_milk_data), nrow(tly_data), nrow(ci_data)),
  Mean = c(mean(afc_data$afc_months), mean(daily_milk_data$daily_yield),
           mean(tly_data$total_lactation_yield), mean(ci_data$calving_interval_days)),
  SD = c(sd(afc_data$afc_months), sd(daily_milk_data$daily_yield),
         sd(tly_data$total_lactation_yield), sd(ci_data$calving_interval_days)),
  Min = c(min(afc_data$afc_months), min(daily_milk_data$daily_yield),
          min(tly_data$total_lactation_yield), min(ci_data$calving_interval_days)),
  Q1 = c(quantile(afc_data$afc_months,.25), quantile(daily_milk_data$daily_yield,.25),
         quantile(tly_data$total_lactation_yield,.25), quantile(ci_data$calving_interval_days,.25)),
  Median = c(median(afc_data$afc_months), median(daily_milk_data$daily_yield),
             median(tly_data$total_lactation_yield), median(ci_data$calving_interval_days)),
  Q3 = c(quantile(afc_data$afc_months,.75), quantile(daily_milk_data$daily_yield,.75),
         quantile(tly_data$total_lactation_yield,.75), quantile(ci_data$calving_interval_days,.75)),
  Max = c(max(afc_data$afc_months), max(daily_milk_data$daily_yield),
          max(tly_data$total_lactation_yield), max(ci_data$calving_interval_days)),
  CV = c(sd(afc_data$afc_months)/mean(afc_data$afc_months)*100,
         sd(daily_milk_data$daily_yield)/mean(daily_milk_data$daily_yield)*100,
         sd(tly_data$total_lactation_yield)/mean(tly_data$total_lactation_yield)*100,
         sd(ci_data$calving_interval_days)/mean(ci_data$calving_interval_days)*100),
  check.names = FALSE)

summary_stats %>% mutate(across(where(is.numeric), ~round(., 2))) %>%
  kable(caption = "Descriptive statistics (after QC filtering)",
        format.args = list(big.mark = ","),
        col.names = c("Trait","N","Mean","SD","Min","Q1","Median","Q3","Max","CV (%)")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#2b6cb0", color = "white")
Descriptive statistics (after QC filtering)
Trait N Mean SD Min Q1 Median Q3 Max CV (%)
AFC (months) 27,669 39.33 11.88 18.00 30.00 36.0 46.0 72.0 30.22
Daily Milk (kg) 584,617 9.21 7.07 0.51 1.77 9.2 14.2 50.0 76.81
Lactation Milk (kg) 9,573 977.77 1,705.30 100.00 186.00 363.5 730.6 19,583.5 174.41
Calving Interval (days) 44,250 459.54 130.92 280.00 365.00 416.0 516.0 900.0 28.49

6.2 Trait Distributions

p1 <- ggplot(afc_data, aes(x = afc_months)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40, fill = trait_colors["AFC"], color = "white", alpha = 0.8) +
  geom_density(linewidth = 1, color = "darkred") +
  geom_vline(xintercept = mean(afc_data$afc_months), linetype = "dashed", linewidth = 1) +
  annotate("text", x = mean(afc_data$afc_months)+2, y = Inf,
           label = paste0("Mean = ", round(mean(afc_data$afc_months),1), " mo"),
           vjust = 2, hjust = 0, fontface = "bold", size = 4) +
  labs(title = "Age at First Calving (AFC)", x = "AFC (months)", y = "Density") +
  scale_x_continuous(breaks = seq(18, 72, 6))

p2 <- ggplot(daily_milk_data, aes(x = daily_yield)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, fill = trait_colors["Daily Milk"], color = "white", alpha = 0.8) +
  geom_density(linewidth = 1, color = "darkblue") +
  geom_vline(xintercept = mean(daily_milk_data$daily_yield), linetype = "dashed", linewidth = 1) +
  annotate("text", x = mean(daily_milk_data$daily_yield)+1, y = Inf,
           label = paste0("Mean = ", round(mean(daily_milk_data$daily_yield),1), " kg"),
           vjust = 2, hjust = 0, fontface = "bold", size = 4) +
  labs(title = "Daily Milk Yield", x = "Daily Milk (kg/day)", y = "Density")

p3 <- ggplot(tly_data, aes(x = total_lactation_yield)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, fill = trait_colors["Lactation Milk"], color = "white", alpha = 0.8) +
  geom_density(linewidth = 1, color = "darkgreen") +
  geom_vline(xintercept = mean(tly_data$total_lactation_yield), linetype = "dashed", linewidth = 1) +
  annotate("text", x = mean(tly_data$total_lactation_yield)+200, y = Inf,
           label = paste0("Mean = ", round(mean(tly_data$total_lactation_yield),0), " kg"),
           vjust = 2, hjust = 0, fontface = "bold", size = 4) +
  labs(title = "Total Lactation Yield (TLY)", x = "Lactation Milk (kg)", y = "Density") +
  scale_x_continuous(labels = comma)

p4 <- ggplot(ci_data, aes(x = calving_interval_days)) +
  geom_histogram(aes(y = after_stat(density)), bins = 50, fill = trait_colors["Calving Interval"], color = "white", alpha = 0.8) +
  geom_density(linewidth = 1, color = "purple4") +
  geom_vline(xintercept = mean(ci_data$calving_interval_days), linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = 365, linetype = "dotted", color = "red", linewidth = 0.8) +
  annotate("text", x = mean(ci_data$calving_interval_days)+15, y = Inf,
           label = paste0("Mean = ", round(mean(ci_data$calving_interval_days),0), " d"),
           vjust = 2, hjust = 0, fontface = "bold", size = 4) +
  annotate("text", x = 370, y = Inf, label = "Optimal\n(365 d)", vjust = 2, hjust = 0,
           color = "red", size = 3.5, fontface = "italic") +
  labs(title = "Calving Interval (CI)", x = "Calving Interval (days)", y = "Density")

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Distribution of All Four Traits",
    subtitle = "Histograms with density curves; dashed = mean",
    theme = theme(plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
                  plot.subtitle = element_text(hjust = 0.5, color = "gray40", size = 14)))

6.3 Box Plots by Region

p1b <- ggplot(afc_data, aes(x = reorder(region, afc_months, FUN=median), y = afc_months)) +
  geom_boxplot(fill = trait_colors["AFC"], alpha = 0.7, outlier.size = 0.4) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "darkred") +
  labs(title = "AFC by Region", x = NULL, y = "AFC (months)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
p2b <- ggplot(daily_milk_data, aes(x = reorder(region, daily_yield, FUN=median), y = daily_yield)) +
  geom_boxplot(fill = trait_colors["Daily Milk"], alpha = 0.7, outlier.size = 0.4) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "darkblue") +
  labs(title = "Daily Milk by Region", x = NULL, y = "Daily Milk (kg)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
p3b <- ggplot(tly_data, aes(x = reorder(region, total_lactation_yield, FUN=median), y = total_lactation_yield)) +
  geom_boxplot(fill = trait_colors["Lactation Milk"], alpha = 0.7, outlier.size = 0.4) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "darkgreen") +
  scale_y_continuous(labels = comma) +
  labs(title = "Lactation Milk by Region", x = NULL, y = "Lactation Milk (kg)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
p4b <- ggplot(ci_data, aes(x = reorder(region, calving_interval_days, FUN=median), y = calving_interval_days)) +
  geom_boxplot(fill = trait_colors["Calving Interval"], alpha = 0.7, outlier.size = 0.4) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "purple4") +
  geom_hline(yintercept = 365, linetype = "dashed", color = "red") +
  labs(title = "Calving Interval by Region", x = NULL, y = "CI (days)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

(p1b + p2b) / (p3b + p4b) +
  plot_annotation(title = "Trait Distributions by Region",
    subtitle = "Boxes = middle 50%; diamond = mean; line = median",
    theme = theme(plot.title = element_text(face="bold", size=18, hjust=0.5),
                  plot.subtitle = element_text(hjust=0.5, color="gray40", size=14)))

6.4 Summary Table by Region

region_afc <- afc_data %>% group_by(region) %>%
  summarise(AFC_N=n(), AFC_Mean=round(mean(afc_months),1), AFC_SD=round(sd(afc_months),1), .groups="drop")
region_dmk <- daily_milk_data %>% group_by(region) %>%
  summarise(DMY_N=n(), DMY_Mean=round(mean(daily_yield),2), DMY_SD=round(sd(daily_yield),2), .groups="drop")
region_tly <- tly_data %>% group_by(region) %>%
  summarise(TLY_N=n(), TLY_Mean=round(mean(total_lactation_yield),0), TLY_SD=round(sd(total_lactation_yield),0), .groups="drop")
region_ci <- ci_data %>% group_by(region) %>%
  summarise(CI_N=n(), CI_Mean=round(mean(calving_interval_days),0), CI_SD=round(sd(calving_interval_days),0), .groups="drop")

region_afc %>% left_join(region_dmk, by="region") %>% left_join(region_tly, by="region") %>%
  left_join(region_ci, by="region") %>% arrange(region) %>%
  kable(caption = "Regional summary: mean (SD) for all four traits",
    col.names = c("Region","N","Mean","SD","N","Mean","SD","N","Mean","SD","N","Mean","SD"),
    format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  row_spec(0, bold = TRUE, background = "#2b6cb0", color = "white") %>%
  add_header_above(c(" "=1, "AFC (months)"=3, "Daily Milk (kg/d)"=3, "Lactation Milk (kg)"=3, "Calving Interval (days)"=3))
Regional summary: mean (SD) for all four traits
AFC (months)
Daily Milk (kg/d)
Lactation Milk (kg)
Calving Interval (days)
Region N Mean SD N Mean SD N Mean SD N Mean SD
Baringo 110 39.5 9.9 4,990 1.25 1.25 2 196 20 328 489 131
Bomet 24 46.1 14.0 374 1.44 0.45 NA NA NA 14 450 132
Bungoma 10 51.6 12.7 83 0.94 0.26 NA NA NA 7 464 115
Busia 14 51.1 10.3 65 9.43 3.24 3 132 26 1 364 NA
Elgeyo Marakwet 5 52.0 16.8 130 1.11 0.61 NA NA NA 4 433 51
Embu 108 45.2 13.6 2,247 1.34 0.88 1 105 NA 91 472 129
Kajiado 71 42.7 13.1 255 1.58 1.18 42 573 290 17 498 202
Kakamega 238 39.1 12.7 1,211 2.04 2.44 7 129 16 441 438 122
Kericho 5 37.6 19.0 40 8.53 8.80 1 317 NA NA NA NA
Kiambu 1,676 37.7 11.0 17,553 2.80 4.82 75 263 105 2,306 453 134
Kilifi 622 32.0 9.8 234,482 14.19 5.27 1,056 3,353 2,721 734 454 111
Kirinyaga 18 35.7 13.3 287 1.50 0.53 NA NA NA 4 476 104
Kisii 57 38.4 12.1 506 5.32 4.72 4 165 61 12 410 65
Kisumu 202 43.3 14.1 4,646 1.39 1.29 5 130 25 244 475 151
Kwale 6 30.2 2.6 24 2.05 0.32 NA NA NA NA NA NA
Laikipia 1,601 31.8 7.8 20,265 3.74 5.13 339 151 32 3,019 449 120
Machakos 44 35.0 8.2 1,676 2.51 0.83 1 112 NA 82 434 107
Makueni 3,144 34.6 5.0 94,386 11.32 3.90 2,385 459 161 10,106 417 90
Meru 209 40.6 11.9 2,173 2.14 4.00 10 185 107 356 494 141
Migori 1 33.0 NA 388 1.41 0.92 NA NA NA 30 519 153
Mombasa 62 28.9 7.0 NA NA NA NA NA NA 185 428 122
Muranga 41 38.5 14.2 613 1.58 0.75 NA NA NA 30 442 142
Nairobi 1,513 34.9 12.1 20,054 3.93 5.38 263 197 67 2,761 451 119
Nakuru 11,474 40.8 12.2 89,291 4.40 6.44 3,885 1,096 1,752 17,606 482 145
Nandi County 804 50.7 12.6 14,582 6.65 3.09 566 210 216 104 499 141
Narok 289 41.0 8.4 4,416 4.11 2.05 156 202 66 64 623 146
Nyamira 9 29.2 2.5 171 1.29 0.54 NA NA NA 8 474 87
Nyandarua 530 40.3 13.0 5,266 2.63 3.14 42 313 376 327 472 146
Nyeri 1,102 35.6 10.4 21,249 5.30 7.22 372 222 72 1,724 473 133
Siaya 40 46.8 15.0 781 1.69 0.65 NA NA NA 47 528 112
Taita Taveta 1 29.0 NA 6 0.77 0.12 NA NA NA NA NA NA
Tharaka Nithi 1 26.0 NA 11 1.21 0.09 NA NA NA NA NA NA
Trans Nzoia 2,407 44.4 11.7 20,108 3.69 5.09 321 191 124 2,059 469 137
Uasin Gishu 1,218 42.6 12.6 22,052 1.76 2.77 37 247 117 1,519 480 139
Vihiga 13 31.8 11.6 81 1.04 0.34 NA NA NA 2 390 16

6.5 Summary by Breed

bind_rows(
  afc_data %>% group_by(main_breed_name) %>% summarise(N=n(), Mean=round(mean(afc_months),1), SD=round(sd(afc_months),1), .groups="drop") %>% mutate(Trait="AFC (months)"),
  daily_milk_data %>% group_by(main_breed_name) %>% summarise(N=n(), Mean=round(mean(daily_yield),2), SD=round(sd(daily_yield),2), .groups="drop") %>% mutate(Trait="Daily Milk (kg)"),
  tly_data %>% group_by(main_breed_name) %>% summarise(N=n(), Mean=round(mean(total_lactation_yield),0), SD=round(sd(total_lactation_yield),0), .groups="drop") %>% mutate(Trait="Lactation Milk (kg)"),
  ci_data %>% group_by(main_breed_name) %>% summarise(N=n(), Mean=round(mean(calving_interval_days),0), SD=round(sd(calving_interval_days),0), .groups="drop") %>% mutate(Trait="CI (days)")
) %>% select(Trait, Breed=main_breed_name, N, Mean, SD) %>%
  kable(caption = "Mean (SD) by breed for each trait (unadjusted)", format.args = list(big.mark=",")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#2b6cb0", color = "white")
Mean (SD) by breed for each trait (unadjusted)
Trait Breed N Mean SD
AFC (months) Ayrshire 5,067 43.60 11.60
AFC (months) Ayrshire crosses 87 36.80 11.00
AFC (months) Boran 101 38.70 9.10
AFC (months) Breed_0 147 37.30 8.90
AFC (months) Breed_12 5 44.00 17.40
AFC (months) Breed_17 1 26.00 NA
AFC (months) Breed_35 1 58.00 NA
AFC (months) Breed_38 1 46.00 NA
AFC (months) Breed_43 1 46.00 NA
AFC (months) Breed_72 1 52.00 NA
AFC (months) Breed_80 1 42.00 NA
AFC (months) Brown Swiss 147 44.20 13.90
AFC (months) FFFB 19 41.10 15.10
AFC (months) Fleckvieh 53 31.80 8.10
AFC (months) Gir 197 31.10 7.50
AFC (months) Girolando 22 29.50 5.30
AFC (months) Guernsey 573 41.50 12.60
AFC (months) Guernsey crosses 8 37.00 15.20
AFC (months) Hereford 2 27.50 0.70
AFC (months) Holstein-Friesian crosses 1,253 35.80 7.00
AFC (months) Holstein Friesian 15,541 37.10 11.00
AFC (months) Jersey 1,130 31.40 10.10
AFC (months) Jersey crosses 70 30.10 6.20
AFC (months) Local Zebu 7 49.00 19.20
AFC (months) Magic 12 37.80 13.50
AFC (months) Others 2 61.50 6.40
AFC (months) Red Poll 23 47.00 12.70
AFC (months) Sahiwal 2,874 49.30 11.40
AFC (months) Sahiwal crosses 226 38.40 8.30
AFC (months) Scandinavian Red 50 26.80 4.50
AFC (months) Unknown 43 40.10 14.00
AFC (months) Zebu 4 43.20 18.60
Daily Milk (kg) Ayrshire 56,017 3.27 4.36
Daily Milk (kg) Ayrshire crosses 14,858 14.33 4.70
Daily Milk (kg) Boran 3,176 10.49 4.33
Daily Milk (kg) Breed_0 2,067 2.16 3.66
Daily Milk (kg) Brown Swiss 18,483 13.14 6.58
Daily Milk (kg) FFFB 245 4.36 4.12
Daily Milk (kg) FFFF 245 4.36 4.12
Daily Milk (kg) Fleckvieh 15,816 12.15 4.51
Daily Milk (kg) Gir 75,910 14.22 5.24
Daily Milk (kg) Girolando 10,343 12.68 4.36
Daily Milk (kg) Guernsey 6,140 1.99 2.74
Daily Milk (kg) Guernsey crosses 215 15.47 5.70
Daily Milk (kg) Hereford 8 1.40 0.32
Daily Milk (kg) Holstein-Friesian crosses 67,360 13.24 5.42
Daily Milk (kg) Holstein Friesian 213,921 6.11 6.59
Daily Milk (kg) Jersey 13,392 5.46 6.13
Daily Milk (kg) Jersey crosses 24,695 14.77 4.95
Daily Milk (kg) Local Zebu 119 3.33 1.94
Daily Milk (kg) Magic 342 4.87 6.69
Daily Milk (kg) Others 5 8.40 1.41
Daily Milk (kg) Red Poll 231 0.79 0.18
Daily Milk (kg) Sahiwal 16,644 3.85 4.25
Daily Milk (kg) Sahiwal crosses 23,655 13.31 5.21
Daily Milk (kg) Scandinavian Red 20,357 13.08 4.36
Daily Milk (kg) Unknown 369 3.90 3.01
Daily Milk (kg) Zebu 4 8.53 1.40
Lactation Milk (kg) Ayrshire 611 221.00 403.00
Lactation Milk (kg) Ayrshire crosses 97 3,127.00 2,480.00
Lactation Milk (kg) Boran 78 448.00 659.00
Lactation Milk (kg) Breed_0 9 179.00 75.00
Lactation Milk (kg) Breed_43 1 108.00 NA
Lactation Milk (kg) Breed_80 1 909.00 NA
Lactation Milk (kg) Brown Swiss 61 4,438.00 3,245.00
Lactation Milk (kg) FFFB 3 141.00 58.00
Lactation Milk (kg) FFFF 3 141.00 58.00
Lactation Milk (kg) Fleckvieh 69 2,848.00 1,909.00
Lactation Milk (kg) Gir 372 3,069.00 2,758.00
Lactation Milk (kg) Girolando 39 3,932.00 3,317.00
Lactation Milk (kg) Guernsey 43 189.00 141.00
Lactation Milk (kg) Guernsey crosses 1 3,248.00 NA
Lactation Milk (kg) Holstein-Friesian crosses 1,264 800.00 1,324.00
Lactation Milk (kg) Holstein Friesian 3,869 566.00 1,355.00
Lactation Milk (kg) Jersey 286 215.00 162.00
Lactation Milk (kg) Jersey crosses 150 2,505.00 2,298.00
Lactation Milk (kg) Magic 6 201.00 64.00
Lactation Milk (kg) Sahiwal 1,848 678.00 696.00
Lactation Milk (kg) Sahiwal crosses 633 2,951.00 2,109.00
Lactation Milk (kg) Scandinavian Red 89 3,391.00 3,304.00
Lactation Milk (kg) Unknown 39 1,167.00 1,151.00
Lactation Milk (kg) Zebu 1 205.00 NA
CI (days) Ayrshire 5,680 491.00 148.00
CI (days) Ayrshire crosses 71 472.00 129.00
CI (days) Boran 163 425.00 86.00
CI (days) Breed_0 155 464.00 124.00
CI (days) Brown Swiss 149 481.00 126.00
CI (days) FFFB 23 532.00 142.00
CI (days) FFFF 23 532.00 142.00
CI (days) Fleckvieh 29 414.00 76.00
CI (days) Gir 204 459.00 108.00
CI (days) Girolando 15 496.00 112.00
CI (days) Guernsey 814 460.00 137.00
CI (days) Guernsey crosses 3 392.00 143.00
CI (days) Holstein-Friesian crosses 3,101 416.00 91.00
CI (days) Holstein Friesian 24,672 448.00 123.00
CI (days) Jersey 2,545 442.00 115.00
CI (days) Jersey crosses 122 419.00 92.00
CI (days) Local Zebu 18 461.00 134.00
CI (days) Magic 38 442.00 114.00
CI (days) Others 8 728.00 10.00
CI (days) Red Poll 17 540.00 189.00
CI (days) Sahiwal 5,123 518.00 155.00
CI (days) Sahiwal crosses 442 475.00 130.00
CI (days) Scandinavian Red 38 502.00 120.00
CI (days) Unknown 797 426.00 93.00
Unadjusted Means

Breed groups are confounded with region and management. Fair comparisons require the fixed-effects models below.

6.6 Lactation Curve (DIM Effect)

dim_summary <- daily_milk_data %>%
  mutate(dim_class_num = as.numeric(as.character(dim_class))) %>%
  group_by(dim_class_num) %>%
  summarise(N = n(), Mean = mean(daily_yield), SE = sd(daily_yield)/sqrt(n()), .groups = "drop") %>%
  mutate(DIM_midpoint = (dim_class_num - 0.5)*30, Lower = Mean - 1.96*SE, Upper = Mean + 1.96*SE)

ggplot(dim_summary, aes(x = DIM_midpoint, y = Mean)) +
  geom_ribbon(aes(ymin = Lower, ymax = Upper), fill = trait_colors["Daily Milk"], alpha = 0.25) +
  geom_line(linewidth = 1.3, color = trait_colors["Daily Milk"]) +
  geom_point(size = 3, color = trait_colors["Daily Milk"]) +
  geom_vline(xintercept = 45, linetype = "dashed", color = "red", linewidth = 0.8) +
  annotate("text", x = 55, y = max(dim_summary$Mean)*0.98, label = "Peak yield (~DIM 45)",
           color = "red", fontface = "italic", hjust = 0, size = 4.5) +
  labs(title = "Lactation Curve: Average Daily Milk Yield by Days in Milk",
       subtitle = "Shaded = 95% CI", x = "Days in Milk (DIM)", y = "Mean Daily Milk Yield (kg)") +
  scale_x_continuous(breaks = seq(0, 500, 50)) + theme(panel.grid.minor = element_blank())

dim_summary %>%
  mutate(`DIM Class` = dim_class_num,
         `DIM Range` = paste0((dim_class_num-1)*30+1, "\u2013", dim_class_num*30),
         `Mean yield (kg)` = round(Mean, 2),
         `95% CI` = paste0("[", round(Lower,2), ", ", round(Upper,2), "]")) %>%
  select(`DIM Class`, `DIM Range`, N, `Mean yield (kg)`, `95% CI`) %>%
  kable(caption = "Mean daily milk by DIM class", format.args = list(big.mark=",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Mean daily milk by DIM class
DIM Class DIM Range N Mean yield (kg) 95% CI
1 1–30 56,973 9.43 [9.36, 9.49]
2 31–60 58,893 11.26 [11.2, 11.33]
3 61–90 54,743 10.78 [10.72, 10.85]
4 91–120 52,668 10.32 [10.25, 10.38]
5 121–150 50,189 9.82 [9.76, 9.88]
6 151–180 50,183 9.50 [9.44, 9.56]
7 181–210 45,369 9.07 [9.01, 9.13]
8 211–240 42,364 8.61 [8.55, 8.67]
9 241–270 40,582 8.28 [8.23, 8.34]
10 271–300 33,811 7.90 [7.84, 7.96]
11 301–330 26,118 7.47 [7.4, 7.54]
12 331–360 19,977 7.31 [7.23, 7.39]
13 361–390 16,074 7.19 [7.09, 7.28]
14 391–420 12,153 6.93 [6.82, 7.03]
15 421–450 10,061 6.63 [6.51, 6.74]
16 451–480 9,046 6.48 [6.36, 6.6]
17 481–510 5,413 6.25 [6.1, 6.41]

6.7 Parity Effect

parity_dm  <- daily_milk_data %>% group_by(parity_class) %>% summarise(Mean=mean(daily_yield), SE=sd(daily_yield)/sqrt(n()), .groups="drop")
parity_tly <- tly_data %>% group_by(parity_class) %>% summarise(Mean=mean(total_lactation_yield), SE=sd(total_lactation_yield)/sqrt(n()), .groups="drop")
parity_ci  <- ci_data %>% group_by(parity_class) %>% summarise(Mean=mean(calving_interval_days), SE=sd(calving_interval_days)/sqrt(n()), .groups="drop")

mkp <- function(df, fc, yl, cy=FALSE) {
  ggplot(df, aes(x=parity_class, y=Mean)) + geom_bar(stat="identity", fill=fc, alpha=0.85) +
    geom_errorbar(aes(ymin=Mean-1.96*SE, ymax=Mean+1.96*SE), width=0.2) +
    geom_text(aes(label=round(Mean, ifelse(yl=="Mean CI (days)",0,1))), vjust=-1.6, fontface="bold", size=4) +
    labs(x="Parity", y=yl) + scale_y_continuous(expand=expansion(mult=c(0,0.18)), labels=if(cy) comma else waiver())
}
p_dm  <- mkp(parity_dm,  trait_colors["Daily Milk"],     "Mean daily milk (kg)") + labs(title="Daily Milk by Parity")
p_tly <- mkp(parity_tly, trait_colors["Lactation Milk"], "Mean lactation milk (kg)", TRUE) + labs(title="Lactation Milk by Parity")
p_ci  <- mkp(parity_ci,  trait_colors["Calving Interval"],"Mean CI (days)") + labs(title="CI by Parity")

(p_dm | p_tly | p_ci) +
  plot_annotation(title = "Effect of Parity on Trait Means (Raw, Unadjusted)",
    subtitle = "Error bars = 95% CI; Parity 4 = '4 and above'",
    theme = theme(plot.title = element_text(face="bold", size=17, hjust=0.5),
                  plot.subtitle = element_text(hjust=0.5, color="gray40", size=13)))


7 Fixed Effects Analysis

Purpose

Quantify and remove systematic environmental effects before genetic evaluation. The residual variation is the raw material for BLUPF90 breeding value estimation.

7.1 Model Specification

\[y_{ijklm} = \mu + \text{Region}_i + \text{Breed}_j + \text{Parity}_k + \text{DIM}_l + e_{ijklm}\]

Note: Region is used instead of Farm for interpretability. Farm will be included in the BLUPF90 evaluation.

7.2 AFC Age at First Calving

Model: AFC ~ Region + Breed

afc_anova_df <- as.data.frame(anova(afc_model)) %>%
  tibble::rownames_to_column("Effect") %>%
  mutate(`% Variance` = round(`Sum Sq`/sum(`Sum Sq`)*100, 2),
         `Pr(>F)` = ifelse(is.na(`Pr(>F)`), NA_character_,
                           ifelse(`Pr(>F)` < 0.001, "< 0.001", as.character(round(`Pr(>F)`, 4)))))
afc_anova_df %>%
  kable(caption = "ANOVA \u2014 AFC ~ Region + Breed", digits = 2, format.args = list(big.mark=",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE, background = "#2b6cb0", color = "white")
ANOVA — AFC ~ Region + Breed
Effect Df Sum Sq Mean Sq F value Pr(>F) % Variance
region 34 473,609.0 13,929.68 126.13 < 0.001 12.12
main_breed_name 31 385,667.1 12,440.87 112.65 < 0.001 9.87
Residuals 27,603 3,048,357.6 110.44 NA NA 78.01
22%
R² — Variance Explained
21.8%
Adjusted R²
±10.51 mo
Residual Std. Error
27,669
Records in Model

7.2.1 AFC — Breed Effects

afc_lsmeans <- afc_data %>% mutate(predicted = predict(afc_model)) %>%
  group_by(main_breed_name) %>%
  summarise(N=n(), Raw_Mean=round(mean(afc_months),2), LS_Mean=round(mean(predicted),2), .groups="drop") %>%
  arrange(LS_Mean)

ggplot(afc_lsmeans, aes(x = reorder(main_breed_name, LS_Mean), y = LS_Mean)) +
  geom_bar(stat="identity", fill=trait_colors["AFC"], alpha=0.8) +
  geom_point(aes(y=Raw_Mean), color="darkred", size=3, shape=17) +
  geom_text(aes(label=paste0(LS_Mean," mo")), vjust=-0.5, fontface="bold", size=3.5) +
  labs(title="Breed Effect on AFC", subtitle="Bars = adjusted; triangles = raw",
       x="Breed", y="AFC (months)") +
  theme(axis.text.x = element_text(angle=45, hjust=1)) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12)))

afc_lsmeans %>% kable(caption="AFC by breed: raw vs. adjusted",
  col.names=c("Breed","N","Raw Mean (mo)","Adjusted Mean (mo)"), format.args=list(big.mark=",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
AFC by breed: raw vs. adjusted
Breed N Raw Mean (mo) Adjusted Mean (mo)
Breed_17 1 26.00 26.00
Scandinavian Red 50 26.84 26.84
Hereford 2 27.50 27.50
Girolando 22 29.45 29.45
Jersey crosses 70 30.11 30.11
Gir 197 31.11 31.11
Jersey 1,130 31.44 31.44
Fleckvieh 53 31.75 31.75
Holstein-Friesian crosses 1,253 35.79 35.79
Ayrshire crosses 87 36.77 36.77
Guernsey crosses 8 37.00 37.00
Holstein Friesian 15,541 37.08 37.08
Breed_0 147 37.28 37.28
Magic 12 37.75 37.75
Sahiwal crosses 226 38.44 38.44
Boran 101 38.67 38.67
Unknown 43 40.09 40.09
FFFB 19 41.05 41.05
Guernsey 573 41.45 41.45
Breed_80 1 42.00 42.00
Zebu 4 43.25 43.25
Ayrshire 5,067 43.59 43.59
Breed_12 5 44.00 44.00
Brown Swiss 147 44.18 44.18
Breed_38 1 46.00 46.00
Breed_43 1 46.00 46.00
Red Poll 23 47.00 47.00
Local Zebu 7 49.00 49.00
Sahiwal 2,874 49.28 49.28
Breed_72 1 52.00 52.00
Breed_35 1 58.00 58.00
Others 2 61.50 61.50

7.3 DMY Daily Milk Yield

Model: Daily_Milk ~ Region + Breed + Parity + DIM_class

dm_anova_df <- as.data.frame(anova(dm_model)) %>%
  tibble::rownames_to_column("Effect") %>%
  mutate(`% Variance` = round(`Sum Sq`/sum(`Sum Sq`)*100, 2),
         `Pr(>F)` = ifelse(is.na(`Pr(>F)`), NA_character_,
                           ifelse(`Pr(>F)` < 0.001, "< 0.001", as.character(round(`Pr(>F)`, 4)))))
dm_anova_df %>%
  kable(caption = "ANOVA \u2014 Daily Milk ~ Region + Breed + Parity + DIM", digits=2, format.args=list(big.mark=",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, bold=TRUE, background="#2b6cb0", color="white") %>%
  row_spec(which(dm_anova_df$`% Variance` == max(dm_anova_df$`% Variance`[dm_anova_df$Effect != "Residuals"], na.rm=TRUE)),
           bold=TRUE, background="#FFF3CD")
ANOVA — Daily Milk ~ Region + Breed + Parity + DIM
Effect Df Sum Sq Mean Sq F value Pr(>F) % Variance
region 34 13,984,781.8 411,317.11 17,832.35 < 0.001 47.82
main_breed_name 25 472,462.1 18,898.49 819.33 < 0.001 1.62
parity_class 3 339,307.2 113,102.41 4,903.47 < 0.001 1.16
dim_class 16 967,813.4 60,488.34 2,622.43 < 0.001 3.31
Residuals 584,538 13,482,827.3 23.07 NA NA 46.10
53.9%
R² — Variance Explained
53.9%
Adjusted R²
±4.8 kg
Residual Std. Error
584,617
Records in Model

7.3.1 Daily Milk — DIM and Breed Effects

dm_dim_means <- daily_milk_data %>%
  mutate(predicted = predict(dm_model), dim_class_n = as.numeric(as.character(dim_class))) %>%
  group_by(dim_class_n) %>%
  summarise(N=n(), Raw_Mean=mean(daily_yield), LS_Mean=mean(predicted), .groups="drop")

p_dim <- ggplot(dm_dim_means, aes(x = dim_class_n*30-15)) +
  geom_ribbon(aes(ymin=LS_Mean*0.95, ymax=LS_Mean*1.05), fill=trait_colors["Daily Milk"], alpha=0.15) +
  geom_line(aes(y=LS_Mean, color="Adjusted"), linewidth=1.3) +
  geom_point(aes(y=LS_Mean, color="Adjusted"), size=3) +
  geom_line(aes(y=Raw_Mean, color="Raw"), linewidth=1, linetype="dashed") +
  geom_point(aes(y=Raw_Mean, color="Raw"), size=2) +
  scale_color_manual(values=c("Adjusted"=trait_colors["Daily Milk"], "Raw"="gray50")) +
  labs(title="Lactation Curve: Adjusted vs Raw", x="Days in Milk", y="Daily Milk (kg)", color=NULL) +
  scale_x_continuous(breaks=seq(0,510,60))

dm_breed_means <- daily_milk_data %>% mutate(predicted = predict(dm_model)) %>%
  group_by(main_breed_name) %>%
  summarise(N=n(), Raw_Mean=mean(daily_yield), LS_Mean=mean(predicted), .groups="drop")

p_breed <- ggplot(dm_breed_means, aes(x=reorder(main_breed_name, LS_Mean), y=LS_Mean)) +
  geom_bar(stat="identity", fill=trait_colors["Daily Milk"], alpha=0.8) +
  geom_point(aes(y=Raw_Mean), color="darkblue", size=3, shape=17) +
  geom_text(aes(label=round(LS_Mean,1)), vjust=-0.5, fontface="bold", size=3.5) +
  labs(title="Breed Effect on Daily Milk", x="Breed", y="Daily Milk (kg)") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  scale_y_continuous(expand=expansion(mult=c(0,0.12)))

p_dim | p_breed

7.4 TLY Total Lactation Yield

Model: Lactation_Milk ~ Region + Breed + Parity

tly_anova_df <- as.data.frame(anova(tly_model)) %>%
  tibble::rownames_to_column("Effect") %>%
  mutate(`% Variance` = round(`Sum Sq`/sum(`Sum Sq`)*100, 2),
         `Pr(>F)` = ifelse(is.na(`Pr(>F)`), NA_character_,
                           ifelse(`Pr(>F)` < 0.001, "< 0.001", as.character(round(`Pr(>F)`, 4)))))
tly_anova_df %>%
  kable(caption = "ANOVA \u2014 TLY ~ Region + Breed + Parity", digits=2, format.args=list(big.mark=",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, bold=TRUE, background="#2b6cb0", color="white")
ANOVA — TLY ~ Region + Breed + Parity
Effect Df Sum Sq Mean Sq F value Pr(>F) % Variance
region 21 7,991,266,686 380,536,509 204.38 < 0.001 28.71
main_breed_name 23 2,000,132,129 86,962,266 46.71 < 0.001 7.19
parity_class 3 109,892,326 36,630,775 19.67 < 0.001 0.39
Residuals 9,525 17,734,451,280 1,861,885 NA NA 63.71
36.3%
R² — Variance Explained
36%
Adjusted R²
±1365 kg
Residual Std. Error
9,573
Records in Model

7.4.1 Parity and Breed Effects on TLY

tly_par_means <- tly_data %>% mutate(predicted = predict(tly_model)) %>%
  group_by(parity_class) %>%
  summarise(N=n(), Raw_Mean=mean(total_lactation_yield), LS_Mean=mean(predicted), .groups="drop")

ggplot(tly_par_means, aes(x=parity_class)) +
  geom_bar(aes(y=LS_Mean, fill="Adjusted"), stat="identity", alpha=0.85) +
  geom_point(aes(y=Raw_Mean, color="Raw"), size=4, shape=17) +
  geom_text(aes(y=LS_Mean, label=paste0(round(LS_Mean,0)," kg")), vjust=-0.6, fontface="bold", size=4) +
  scale_fill_manual(values=c("Adjusted"=trait_colors["Lactation Milk"])) +
  scale_color_manual(values=c("Raw"="darkgreen")) +
  labs(title="Parity Effect on TLY", x="Parity", y="Lactation Milk (kg)", fill=NULL, color=NULL) +
  scale_y_continuous(labels=comma, expand=expansion(mult=c(0,0.15)))

tly_breed_means <- tly_data %>% mutate(predicted = predict(tly_model)) %>%
  group_by(main_breed_name) %>%
  summarise(N=n(), Raw_Mean=mean(total_lactation_yield), LS_Mean=mean(predicted), .groups="drop") %>%
  arrange(LS_Mean)

ggplot(tly_breed_means, aes(x=reorder(main_breed_name, LS_Mean), y=LS_Mean)) +
  geom_bar(stat="identity", fill=trait_colors["Lactation Milk"], alpha=0.85) +
  geom_point(aes(y=Raw_Mean), color="darkgreen", size=3, shape=17) +
  geom_text(aes(label=paste0(format(round(LS_Mean,0), big.mark=",")," kg")), vjust=-0.5, fontface="bold", size=3.5) +
  labs(title="Breed Effect on TLY", x="Breed", y="Lactation Milk (kg)") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  scale_y_continuous(labels=comma, expand=expansion(mult=c(0,0.14)))

tly_breed_means %>% mutate(across(c(Raw_Mean,LS_Mean), ~round(.,0))) %>%
  kable(caption="TLY by breed: raw vs. adjusted", col.names=c("Breed","N","Raw (kg)","Adjusted (kg)"),
        format.args=list(big.mark=",")) %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) %>%
  row_spec(0, bold=TRUE, background="#2b6cb0", color="white")
TLY by breed: raw vs. adjusted
Breed N Raw (kg) Adjusted (kg)
Breed_43 1 108 108
FFFF 3 141 141
FFFB 3 141 141
Breed_0 9 179 179
Guernsey 43 189 189
Magic 6 201 201
Zebu 1 205 205
Jersey 286 215 215
Ayrshire 611 221 221
Boran 78 448 448
Holstein Friesian 3,869 566 566
Sahiwal 1,848 678 678
Holstein-Friesian crosses 1,264 800 800
Breed_80 1 909 909
Unknown 39 1,167 1,167
Jersey crosses 150 2,505 2,505
Fleckvieh 69 2,848 2,848
Sahiwal crosses 633 2,951 2,951
Gir 372 3,069 3,069
Ayrshire crosses 97 3,127 3,127
Guernsey crosses 1 3,248 3,248
Scandinavian Red 89 3,391 3,391
Girolando 39 3,932 3,932
Brown Swiss 61 4,438 4,438

7.5 CI Calving Interval

Model: CI ~ Region + Breed + Parity (parity 2+ only)

ci_anova_df <- as.data.frame(anova(ci_model)) %>%
  tibble::rownames_to_column("Effect") %>%
  mutate(`% Variance` = round(`Sum Sq`/sum(`Sum Sq`)*100, 2),
         `Pr(>F)` = ifelse(is.na(`Pr(>F)`), NA_character_,
                           ifelse(`Pr(>F)` < 0.001, "< 0.001", as.character(round(`Pr(>F)`, 4)))))
ci_anova_df %>%
  kable(caption = "ANOVA \u2014 CI ~ Region + Breed + Parity", digits=2, format.args=list(big.mark=",")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
  row_spec(0, bold=TRUE, background="#2b6cb0", color="white")
ANOVA — CI ~ Region + Breed + Parity
Effect Df Sum Sq Mean Sq F value Pr(>F) % Variance
region 31 32,532,101 1,049,422.62 65.28 < 0.001 4.29
main_breed_name 23 13,856,888 602,473.38 37.48 < 0.001 1.83
parity_class 2 1,563,716 781,857.98 48.63 < 0.001 0.21
Residuals 44,193 710,458,036 16,076.26 NA NA 93.68
6.3%
R² — Variance Explained
6.2%
Adjusted R²
±127 days
Residual Std. Error
44,250
Records in Model
ci_breed_means <- ci_data %>% mutate(predicted = predict(ci_model)) %>%
  group_by(main_breed_name) %>%
  summarise(N=n(), Raw_Mean=mean(calving_interval_days), LS_Mean=mean(predicted), .groups="drop")

ggplot(ci_breed_means, aes(x=reorder(main_breed_name, LS_Mean), y=LS_Mean)) +
  geom_bar(stat="identity", fill=trait_colors["Calving Interval"], alpha=0.85) +
  geom_point(aes(y=Raw_Mean), color="purple4", size=3, shape=17) +
  geom_text(aes(label=round(LS_Mean,0)), vjust=-0.5, fontface="bold", size=4) +
  geom_hline(yintercept=365, linetype="dashed", color="red", linewidth=0.8) +
  annotate("text", x=0.6, y=375, label="Economic optimum (365 d)", color="red", hjust=0, fontface="italic", size=4) +
  labs(title="Breed Effect on Calving Interval", x="Breed", y="CI (days)") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  scale_y_continuous(expand=expansion(mult=c(0,0.12)))

7.6 R² Summary Across All Traits

r2_df <- data.frame(
  Trait = c("AFC","Daily Milk","Lactation Milk","Calving Interval"),
  Model = c("Region + Breed","Region + Breed + Parity + DIM","Region + Breed + Parity","Region + Breed + Parity"),
  R2    = c(afc_r2, dm_r2, tly_r2, ci_r2),
  AdjR2 = c(afc_adjr2, dm_adjr2, tly_adjr2, ci_adjr2))

ggplot(r2_df, aes(x=reorder(Trait,R2), y=R2, fill=Trait)) +
  geom_bar(stat="identity", alpha=0.85, show.legend=FALSE) +
  geom_text(aes(label=paste0(R2,"%")), hjust=-0.2, fontface="bold", size=5) +
  scale_fill_manual(values=trait_colors) +
  labs(title="Variance Explained by Fixed Effects", x=NULL, y="R\u00B2 (%)") +
  scale_y_continuous(expand=expansion(mult=c(0,0.20)), limits=c(0,100)) + coord_flip()

r2_df %>% kable(caption="R\u00B2 by trait", col.names=c("Trait","Model","R\u00B2 (%)","Adj R\u00B2 (%)")) %>%
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) %>%
  row_spec(0, bold=TRUE, background="#2b6cb0", color="white")
R² by trait
Trait Model R² (%) Adj R² (%)
AFC Region + Breed 22.0 21.8
Daily Milk Region + Breed + Parity + DIM 53.9 53.9
Lactation Milk Region + Breed + Parity 36.3 36.0
Calving Interval Region + Breed + Parity 6.3 6.2

AFC 22%

Region and breed explain this share. Moderate genetic variance available.

DMY 53.9%

DIM class drives most of this. After adjustment, large genetic spread remains.

TLY 36.3%

Most TLY variation is genetic + random — high selection potential.

CI 6.3%

Very low R². CI dominated by random fertility events; h² ≈ 0.03–0.06.
Key Insight

The lower the R², the more unexplained variation remains — much of it genetic. All four traits carry sufficient genetic variance for BLUPF90 evaluation.


8 Comprehensive Summary

8.1 Full Summary Table

final_summary <- data.frame(
  Metric = c("Records (after QC)","Animals","Farms","Mean","Standard deviation","CV (%)",
             "Minimum (QC limit)","Maximum (QC limit)","R\u00B2 of fixed effects (%)"),
  AFC = c(afc_nrec, format(n_distinct(afc_data$animal_id),big.mark=","),
          format(n_distinct(afc_data$farm_id),big.mark=","),
          round(mean(afc_data$afc_months),1), round(sd(afc_data$afc_months),2),
          round(sd(afc_data$afc_months)/mean(afc_data$afc_months)*100,1),
          round(min(afc_data$afc_months),1), round(max(afc_data$afc_months),1), afc_r2),
  `Daily Milk` = c(dm_nrec, format(n_distinct(daily_milk_data$animal_id),big.mark=","),
          format(n_distinct(daily_milk_data$farm_id),big.mark=","),
          round(mean(daily_milk_data$daily_yield),2), round(sd(daily_milk_data$daily_yield),2),
          round(sd(daily_milk_data$daily_yield)/mean(daily_milk_data$daily_yield)*100,1),
          round(min(daily_milk_data$daily_yield),1), round(max(daily_milk_data$daily_yield),1), dm_r2),
  `Lactation Milk` = c(tly_nrec, format(n_distinct(tly_data$animal_id),big.mark=","),
          format(n_distinct(tly_data$farm_id),big.mark=","),
          round(mean(tly_data$total_lactation_yield),0), round(sd(tly_data$total_lactation_yield),0),
          round(sd(tly_data$total_lactation_yield)/mean(tly_data$total_lactation_yield)*100,1),
          round(min(tly_data$total_lactation_yield),0), round(max(tly_data$total_lactation_yield),0), tly_r2),
  `Calving Interval` = c(ci_nrec, format(n_distinct(ci_data$animal_id),big.mark=","),
          format(n_distinct(ci_data$farm_id),big.mark=","),
          round(mean(ci_data$calving_interval_days),0), round(sd(ci_data$calving_interval_days),0),
          round(sd(ci_data$calving_interval_days)/mean(ci_data$calving_interval_days)*100,1),
          round(min(ci_data$calving_interval_days),0), round(max(ci_data$calving_interval_days),0), ci_r2),
  check.names = FALSE)

final_summary %>%
  kable(caption = "Comprehensive summary", col.names = c("Metric","AFC (months)","Daily Milk (kg)","Lactation Milk (kg)","Calving Interval (days)")) %>%
  kable_styling(bootstrap_options = c("striped","hover"), full_width = TRUE) %>%
  row_spec(0, bold=TRUE, background="#2b6cb0", color="white") %>%
  row_spec(c(4,9), bold = TRUE) %>%
  pack_rows("Data Volume",1,3) %>% pack_rows("Distribution",4,8) %>% pack_rows("Model Fit",9,9)
Comprehensive summary
Metric AFC (months) Daily Milk (kg) Lactation Milk (kg) Calving Interval (days)
Data Volume
Records (after QC) 27,669 584,617 9,573 44,250
Animals 27,669 15,478 4,941 17,635
Farms 1,247 651 356 453
Distribution
Mean 39.3 9.21 978 460
Standard deviation 11.88 7.07 1705 131
CV (%) 30.2 76.8 174.4 28.5
Minimum (QC limit) 18 0.5 100 280
Maximum (QC limit) 72 50 19584 900
Model Fit
R² of fixed effects (%) 22 53.9 36.3 6.3

8.2 Key Trait Means

39.3
AFC (months)
9.2
Daily Milk (kg/d)
978
Lactation Milk (kg)
460
Calving Interval (days)

8.3 Key Conclusions

✅ Data Pipeline Complete

Three raw CSV files merged, harmonised, and QC-filtered into four analysis-ready trait datasets covering multiple regions, breeds, and parities.

✅ Fixed Effects Characterised

Region, Breed, Parity, and DIM effects quantified for all four traits. All effects are highly significant (P < 0.001). DIM dominates daily milk variation.

✅ Genetic Variance Confirmed

Low-to-moderate R² values confirm substantial genetic variance remains after environmental correction — all four traits are suitable for BLUPF90 evaluation.

➡️ Next Step: Genetic Parameters

Variance component estimation using BLUPF90 animal models, followed by EBV computation and bull/cow ranking for selection decisions.


9 References

  1. Henderson, C.R. (1984). Applications of Linear Models in Animal Breeding. University of Guelph.
  2. Misztal, I., et al. (2018). Manual for BLUPF90 family programs. University of Georgia.
  3. Mrode, R., et al. (2021). Genomic selection in developing countries. Frontiers in Genetics, 10, 694.
  4. Ojango, J.M.K., et al. (2019). Genetic evaluation of test-day milk yields from smallholder dairy systems. Journal of Dairy Science, 102(6).
  5. Wood, P.D.P. (1967). Algebraic model of the lactation curve in cattle. Nature, 216, 164–165.

10 Session Information

sessionInfo()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_Kenya.utf8  LC_CTYPE=English_Kenya.utf8   
[3] LC_MONETARY=English_Kenya.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_Kenya.utf8    

time zone: Africa/Nairobi
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] broom_1.0.9       scales_1.4.0      kableExtra_1.4.0  knitr_1.50       
[5] patchwork_1.3.2   ggplot2_3.5.2     tidyr_1.3.1       dplyr_1.1.4      
[9] data.table_1.17.8

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.1     tidyselect_1.2.1  
 [5] xml2_1.4.0         stringr_1.5.1      jquerylib_0.1.4    textshaping_1.0.2 
 [9] systemfonts_1.2.3  yaml_2.3.10        fastmap_1.2.0      R6_2.6.1          
[13] labeling_0.4.3     generics_0.1.4     backports_1.5.0    tibble_3.3.0      
[17] svglite_2.2.1      bslib_0.9.0        pillar_1.11.0      RColorBrewer_1.1-3
[21] rlang_1.1.6        stringi_1.8.7      cachem_1.1.0       xfun_0.52         
[25] sass_0.4.10        viridisLite_0.4.2  cli_3.6.5          withr_3.0.2       
[29] magrittr_2.0.3     digest_0.6.37      grid_4.5.1         rstudioapi_0.17.1 
[33] lifecycle_1.0.4    vctrs_0.6.5        evaluate_1.0.5     glue_1.8.0        
[37] farver_2.1.2       rmarkdown_2.29     purrr_1.1.0        tools_4.5.1       
[41] pkgconfig_2.0.3    htmltools_0.5.8.1