Task 1: Data Mapping and Understanding

#Load Libraries
rm(list=ls())
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
setwd("C:\\AGR333")

Read Excel sheets

excel_file<-read_xlsx("data_animal_science.xlsx")
df_milk <- read_xlsx("data_animal_science.xlsx", sheet = "Milk Consumption")
df_weight <- read_xlsx("data_animal_science.xlsx", sheet = "Body Weights")
df_serum <- read_xlsx("data_animal_science.xlsx", sheet = "Serum Total Protein")
## Warning: Expecting numeric in D2314 / R2314C4: got '.'
## Warning: Expecting numeric in D2345 / R2345C4: got '.'
## Warning: Expecting numeric in D4525 / R4525C4: got '.'
## Warning: Expecting numeric in D4527 / R4527C4: got '.'

Task 2: Outlier Evaluation Graphs

Graph for Milk Consumption

ggplot(df_milk,aes(x=Milk_Consumption_Liters))+
  geom_histogram(bins=30,fill="purple",alpha=0.7)+
  labs(title="Milk Consumption in Liters per Calf",
       x="Milk Consumption in Liters",
       y="Count")

Figure 1. Histogram of milk consumption (liters) per calf.

ggplot(df_milk, aes(x = 1, y = Milk_Consumption_Liters)) +
  geom_jitter(width = 0.1, alpha = 0.4) +
  labs(title = "Milk Consumption",
       y = "Milk Consumption in Liters",
       x = "") +
  theme_minimal()

Figure 2. Jitter plot of Calf Milk Consumption in Liters

Graph for Body Weights

ggplot(df_weight, aes(y = Birth_Weight)) +
  geom_boxplot(fill = "orange", alpha = 0.7) +
  labs(title = "Boxplot of Birth Weight",
       y = "Birth Weight") +
  theme_minimal()

ggplot(df_weight, aes(x = 1, y = Birth_Weight)) +
  geom_jitter(width = 0.1, alpha = 0.4) +
  labs(title = "Individual Birth Weights",
       y = "Birth Weight",
       x = "") +
  theme_minimal()

ggplot(df_weight, aes(x = Birth_Weight)) +
  geom_histogram(binwidth = 10, fill = "orange", color = "black") +
  labs(title = "Histogram of Birth Weight",
       x = "Birth Weight",
       y = "Count") +
  theme_minimal()

Figure 3,4,5. Boxplot, Jitter plot, and Historgram of Calf Individual birth weights.

Graph for Serum Total Proteins

df_serum_clean <- df_serum %>%     ##Had to remove extreme outlier of 28980.00
  filter(Serum_Total_Protein < 50)

ggplot(df_serum_clean, aes(y = Serum_Total_Protein)) +
  geom_boxplot(fill = "lightgreen", alpha = 0.7) +
  labs(title = "Serum Total Protein",
       y = "Serum Total Protein") +
  theme_minimal()

Figure 6. Boxplot of Serum Total Protein

Task 3: Outlier Analysis

1. Explain your choice of graph type for exploring outliers:

I used boxplots, histograms, and jitter plots because each graph shows outliers in a different way. Boxplots clearly highlight extreme values, histograms show the overall distribution and tail ends, and jitter plots display individual data points and where values are most concentrated. Using multiple graphs allowed me to better understand the data and confirm potential outliers.

2. Do you consider any of the data points outliers?

Yes, several data points appear to be outliers. In the milk consumption data, values below 250 liters and above 750 liters were considered outliers. In the birth weight data, values below 50 and above 150 were identified as outliers. In the serum total protein data, one extreme value near 28,980 was clearly an outlier and biologically unrealistic. Additional low and high protein values were also considered minor outliers.

3. How did you come to that conclusion?

These conclusions were based on visual patterns in the graphs and statistical summaries. In the plots, most values clustered within a narrow range, while a few points were far outside this range. The histograms showed long tails, boxplots highlighted extreme values, and the summary statistics revealed unusually large maximum values. These observations indicated that certain points did not fit the overall distribution of the data.

4. What would you do next if there were outliers?

If outliers are present, the next step is to investigate their cause by checking for data entry errors, incorrect units, or measurement mistakes. For example, in the serum total protein data, an extreme value of 28,980 was identified using a boxplot and summary statistics. Because this value was biologically unrealistic, it was removed by filtering the data to include only values below 50. After cleaning the data, the plots were recreated to confirm that the distribution was reasonable. If outliers are biologically meaningful, they should be kept and discussed in the analysis. After addressing outliers, the data should be reanalyzed to ensure accurate results.

Task 4: Data Filtering using dplyr

library(dplyr)
# Filter milk consumption data
filtered_milk <- df_milk %>%
  filter(Milk_Consumption_Liters > 50)

# Filter body weights data
filtered_weight <- df_weight %>%
  filter(Birth_Weight >= 50, Birth_Weight <= 150)

# Filter serum total protein data
filtered_serum <- df_serum_clean %>%
  filter(Days_of_age>= 1, Days_of_age <= 10)
##Task 5: Data Combination and Relationship Analysis
combined_data <- filtered_milk %>%
  inner_join(filtered_weight, by = "Calf_ID") %>%
  inner_join(filtered_serum, by = "Calf_ID")

Task 5: Data Combination and Relationship Analysis

#Relationship between Birthweight and Milk Consumption
ggplot(combined_data, aes (x=Birth_Weight, y=Milk_Consumption_Liters))+
  geom_point()+
  labs(title="Birth Weight vs Milk Consumption",
       x="Birth Weight",
       y="Milk Consumption in Liters")

Figure 7. Scatterplot showing relationship between Birth Weight and Milk Consumption

Milk consumption was mostly concentrated between ~400–650 L across birth weights of ~70–110 lb. The scatterplot suggests little to no strong linear relationship between birth weight and milk consumption, although a slight positive trend may be present. Several low and high milk-consumption outliers were observed.

#Relationship of Serum Total Protein and Milk Consumption
ggplot(combined_data, aes (x=Serum_Total_Protein, y=Milk_Consumption_Liters))+
  geom_point()+
  labs(title="Serum Total Protein vs Milk Consumption",
       x="STP",
       y="Milk Consumption in Liters")

Figure 8. Scatterplot showing relationship between Milk Consumption and Serum Total Protein

The scatterplot shows no strong relationship between serum total protein and milk consumption. Most calves consumed between approximately 400 and 650 liters regardless of protein level. Although several low and high outliers were observed, no clear trend was present.