Happy Heffers

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

The Happy Heffers Farm

The Happy Heffers Farm is looking to find a new place to raise dairy cows for milking. Where should they go, what cows should they buy? These questions, and probably some ice cream, are on the lips of their investors.

The Happy Heffers Farm needs cows and in order to get the best cows the management has hired us to make some visual guides to help their folks along. As such we’re going to need some data, starting with https://www.kaggle.com/datasets/shahhet2812/cattle-health-and-feeding-data. This data set has two csv files we will need.

df = pd.read_csv('global_cattle_milk_yield_prediction_dataset.csv')
df2 = pd.read_csv('global_cattle_disease_detection_dataset.csv')

Where are the Cows?

First things first we should see what cows are the most popular in each reagion, we dont know where Happy Heffers is going to be so a few pie charts will give us the top 5 cows in each region.

#I had to split the region pie charts as the loop broke markup. As such this is 6 regions split
subset = df[df['Region'] == 'Africa']
breed_counts = subset['Breed'].value_counts().nlargest(5)

plt.figure(figsize=(6, 6))
plt.pie(
    breed_counts,
    labels=breed_counts.index,
    autopct=lambda pct: f"{int(round(pct/100.*breed_counts.sum()))}",
    startangle=90
)
plt.title("Top 5 Breeds in Africa (by Count)")
plt.savefig("africa_pie.png", dpi=150, bbox_inches='tight')
plt.show()

subset = df[df['Region'] == 'South_America']
breed_counts = subset['Breed'].value_counts().nlargest(5)

plt.figure(figsize=(6, 6))
plt.pie(
    breed_counts,
    labels=breed_counts.index,
    autopct=lambda pct: f"{int(round(pct/100.*breed_counts.sum()))}",
    startangle=90
)
plt.title("Top 5 Breeds in South_America (by Count)")
plt.show()

subset = df[df['Region'] == 'Oceania']
breed_counts = subset['Breed'].value_counts().nlargest(5)

plt.figure(figsize=(6, 6))
plt.pie(
    breed_counts,
    labels=breed_counts.index,
    autopct=lambda pct: f"{int(round(pct/100.*breed_counts.sum()))}",
    startangle=90
)
plt.title("Top 5 Breeds in Oceania (by Count)")
plt.show()

subset = df[df['Region'] == 'Europe_NA']
breed_counts = subset['Breed'].value_counts().nlargest(5)

plt.figure(figsize=(6, 6))
plt.pie(
    breed_counts,
    labels=breed_counts.index,
    autopct=lambda pct: f"{int(round(pct/100.*breed_counts.sum()))}",
    startangle=90
)
plt.title("Top 5 Breeds in Europe_NA (by Count)")
plt.show()

subset = df[df['Region'] == 'South_Asia']
breed_counts = subset['Breed'].value_counts().nlargest(5)

plt.figure(figsize=(6, 6))
plt.pie(
    breed_counts,
    labels=breed_counts.index,
    autopct=lambda pct: f"{int(round(pct/100.*breed_counts.sum()))}",
    startangle=90
)
plt.title("Top 5 Breeds in South_Asia (by Count)")
plt.show()

subset = df[df['Region'] == 'Global']
breed_counts = subset['Breed'].value_counts().nlargest(5)

plt.figure(figsize=(6, 6))
plt.pie(
    breed_counts,
    labels=breed_counts.index,
    autopct=lambda pct: f"{int(round(pct/100.*breed_counts.sum()))}",
    startangle=90
)
plt.title("Top 5 Breeds in Global (by Count)")
plt.show()

Analysis

There are quite a few charts here but its critical to know what region has what.These graphs show the preferred cow varies region to region. Wherever Happy Heffers goes, they can expect a different cow to be king… or queen actually.

The Best Cows

With that good starter information of what cows are where, lets look at when a cow can be expected to produce the most milk. A simple bar chart showing the milk yield over the years will be useful for determining which age is the most productive.

avg_by_month = df.groupby('Age_Months', as_index=False)['Previous_Week_Avg_Yield'].mean()

#Convert months to years
avg_by_month['Age_Years'] = avg_by_month['Age_Months'] / 12



#Round to year as monthly crouds the vis
avg_by_month['Age_Years'] = avg_by_month['Age_Years'].round(0)

avg_by_month['Age_Years'] = avg_by_month['Age_Years'].astype(int)
df_sort_age = avg_by_month.sort_values('Age_Years')
# Create the bar chart
avg_yield_by_age = df_sort_age.groupby('Age_Years')['Previous_Week_Avg_Yield'].mean()


avg_yield_by_age.plot(kind='bar', color='blue')
plt.xlabel('Age (Years)')
plt.ylabel('Average Yield')
plt.title('Average Weekly Yield by Age')
plt.tight_layout()
plt.show()

Analysis

This bar chart shows that on average the highest production is between 4-8 years, with average production dropping after 8 years. But this is the average accross all breeds.

The Udder Ultimate.

This information is not sufficient though as we dont know what breed of cow we’re dealing with (remember regions have different cows). Additionally the average may be pulled down or up by different breeds and needs to be broken out. Happy Heffers is looking for high milk production so lets look at the top 5 breeds by milk production with a multiple line graph.

df['Age_Years'] = (df['Age_Months'] / 12).astype(int)
#Find the top 5 breeds by total milk yield
top_breeds = (
    df.groupby('Breed')['Milk_Yield_L']
      .sum()
      .nlargest(5)
      .index
)
#Filter the dataframe for those breeds only
df_top = df[df['Breed'].isin(top_breeds)]
#Compute the average milk yield per year per breed
avg_milk = (
    df_top.groupby(['Breed', 'Age_Years'])['Milk_Yield_L']
           .mean()
           .reset_index()
)

plt.figure(figsize=(10,5))
for breed in top_breeds:
    breed_data = avg_milk[avg_milk['Breed'] == breed]
    plt.plot(
        breed_data['Age_Years'],
        breed_data['Milk_Yield_L'],
        marker='o',
        label=breed
    )

plt.title('Average Milk Yield per Year for Top 5 Breeds')
plt.xlabel('Age (Years)')
plt.ylabel('Average Milk Yield (L)')
plt.legend(title='Breed')
plt.grid(True)
plt.tight_layout()
plt.show()

Analysis:

Wow, Holstein-Friesian cows greatly outperform other breeds with milk production making them ideal dary cows. But the next 4 are no slouches either as they produce, in their prime years, close to double the average. Additionally they produce higher than the average even into their later years. This gives us options not limiting us to the highest producer, especially if Holstein-Friesian dont work for our climate or are hard/costly to source.

Healthy Heffers

Now that we know what breed of cow is most productive, we should see if/how treatement of cows matter. Its one for Happy Heffers to claim happy cows, but we must see that it is true! Lets make a heatmap to see where health, milking, and milk production are best so we can make the best schedules for our cows.

pivot_df = df.pivot_table(
    index='Body_Condition_Score',
    columns='Milking_Interval_hrs',
    values='Milk_Yield_L',
    aggfunc='mean'  # average Milk Yield
)

plt.figure(figsize=(8,6))
sns.heatmap(
    pivot_df,
    cmap='viridis',          
    annot=True,            
    fmt=".2f"               
)
plt.title('Average Milk Yield (L)')
plt.xlabel('Milking Interval (hours)')
plt.ylabel('Body Condition Score')
plt.gca().invert_yaxis()

Analysis:

The highest yield per Milking is with high condition cows every 24 hours. But that is not all this shows, as the milking intervals are 1/4 1/3 1/2 and 1 24 hour cycle. That means that high condition cows can be milked for nearly the same amount every 12 hours as they do every 24. There is also a minimal loss in milk for milking every 8 hours, but a signifigant loss milking 6. What his graph shows is that on average higher condition cows can perform better with more regular milking than lower condition scores meaning more milk.

Keeping Cows Healthy

So how do we keep them healthy? We should see the common vaccines that cows are given. These vaccines are critical as having any of these diseases would be the end of the cow, and if it spreads, the farm. The following chart will show four common vaccines, Antrax, IBR, BVD, and Rabies. Since a cow can have any combination of the vaccines, the data will show the multiple combos of vaccines farmers may have already provided their cows. Now there may be many reasons for a cow to be considered unhealthy in the dataset, but the vaccines are not for ALL the diseases. As such anything not “Healthy” will be “Unhealthy” even though some may not be as impactful as the diseases mentioned previously. Lets see if theres some health benifits, besides the immunization to the primary diseases, to be seen in the vaccines.

vaccine_acronyms = {
    'Anthrax_Vaccine': 'A',
    'IBR_Vaccine': 'I',
    'BVD_Vaccine': 'B',
    'Rabies_Vaccine': 'R'
}
vaccine_cols = list(vaccine_acronyms.keys())

# Normalize Disease_Status -> Healthy / Not_Healthy.
df2['Health_Group'] = df2['Disease_Status'].apply(lambda x: 'Healthy' if x == 'Healthy' else 'Not_Healthy')

# Create acronym combination label
def combo_label(row):
    vaccines = [vaccine_acronyms[v] for v in vaccine_cols if row[v] == 1]
    if len(vaccines) == 0:
        return 'None'
    elif len(vaccines) == 1:
        return vaccines[0]
    else:
        return ''.join(vaccines)

df2['Vaccine_Combo'] = df2.apply(combo_label, axis=1)

# Count Healthy vs Not_Healthy per combo
combo_status_counts = df2.groupby(['Vaccine_Combo', 'Health_Group']).size().unstack(fill_value=0)

# Prepare data for plotting
inner_labels = combo_status_counts.index
inner_sizes = combo_status_counts.sum(axis=1)

outer_sizes = []
outer_colors = []
outer_label_text = []

base_colors = plt.cm.tab20.colors
status_colors = {'Healthy': '#8fd694', 'Not_Healthy': '#f28e8e'}

# Compute outer ring (with % per combo)
for combo in combo_status_counts.index:
    total = combo_status_counts.loc[combo].sum()
    for status in combo_status_counts.columns:
        count = combo_status_counts.loc[combo, status]
        if count > 0:
            pct = (count / total) * 100
            outer_sizes.append(count)
            outer_colors.append(status_colors[status])
            outer_label_text.append(f"{status}\n{pct:.1f}%")
            
#Donuts mmmm.
fig, ax = plt.subplots(figsize=(10,10))
ax.axis('equal')
#Inner ring (vaccine combos)
wedges1, _ = ax.pie(inner_sizes, radius=1.0, labels=None,
                    colors=base_colors[:len(inner_labels)],
                    wedgeprops=dict(width=0.3, edgecolor='black'))
#Outer ring
wedges2, texts = ax.pie(
    outer_sizes,
    radius=1.3,
    labels=outer_label_text,  # ⬅ use the prepared percentage labels
    labeldistance= 1.01,  
    colors=outer_colors,
    wedgeprops=dict(width=0.3, edgecolor='black'),
    textprops=dict(fontsize=9)
)

# Add the labels, While I could put the labels inside the donut sections, it started to look cluttered.
#I asked ChatGPT what it would suggest and it said to dynamically set them to sit inside the donut. Shrinking the inside makes the outside messy.
for w, label in zip(wedges1, inner_labels):
    ang = (w.theta2 + w.theta1) / 2  # midpoint angle
    x = np.cos(np.deg2rad(ang)) * 0.65  # radius for label
    y = np.sin(np.deg2rad(ang)) * 0.65
    ax.text(x, y, label, ha='center', va='center', fontsize=10, weight='bold')

plt.title("Vaccine Combination (A=Anthrax, I=IBR, B=BVD, R=Rabies)\nand Health Status (%)", fontsize=13,y=-0.15)
plt.show()

Analysis:

This graph is large as there are 16 possible states a cow may find themselves in before purchase. But for our needs, which is healthy cows, we can see that the small population of cows without vaccines are unhealthy at a slightly higher rate than all other categories. Over many years or many farms these slight changes or reductions/additions to the general condition of the cows will greatly impact the production of the heffers.

Conclusion

What does all this data say? Well for one, the most productive dary cow is NOT the most popular cow anywhere in the world. This could be because cows are being used for more than just milk, or could be due to expense of the highest producer. For Happy Heffers, it would matter where the farm needs to be to make the call but we have 4 other breeds to choose from that are near double (15-17L) average (8-9L) in milk production. Choosing the right cow is critical! Next we can see healthy cows can handle an increased milking schedule, producing large quantities of milk for sale, so having healthy cows is critical to maximize the production of milk. Thus, having vaccinated cows improves the odds of healthy cows, improving yield. Happy Heffers can choose from a few top producing breeds, most likely by availability in region. They should chose cows between 3-6 years to get at minimum 2 years of high average milk production. And they should choose and maintain healthy cows, by at minimum vaccination, but also by having a schedule that results in optimum milk production.