Assignment 1

Author

Zilan Oz

Modified

November 1, 2024

Part I. Checking the data

Problem 1: In the subset file “sd02”, there are two dedicated columns for conveying “Sex” information of the animals:

One named as “Sex” and given “M” and “F” values.
The other named as “sex2” and given values 0 and 1.

We can use the Sex column (M for Male, F for Female) as the reference and update or create a new column to replace sex2 (0 and 1) with consistent, meaningful labels.

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("readr")


The downloaded binary packages are in
    /var/folders/xh/l8syzm3s185d0wjdjzy_lyd00000gn/T//RtmpEfSGUr/downloaded_packages

library(tidyverse)
library(datasets)
library(readxl)
library(readr)
library(readxl)
library(dplyr)

all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")
metadata <- read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")


subset_34 <- subset_34 %>%
  mutate(
    Sex = case_when(
      Sex == "M" ~ "Male",    # Replace "M" with "Male"
      Sex == "F" ~ "Female",  # Replace "F" with "Female"
      TRUE ~ NA_character_    # Handle any unexpected values
    ),
    sex2 = case_when(
      Sex == "Male" ~ 1,      # Male is encoded as 1
      Sex == "Female" ~ 0,    # Female is encoded as 0
      TRUE ~ NA_real_         # Handle any unexpected values
    )
  )

# View the updated dataset
head(subset_34)

# A tibble: 6 × 53
  AnimalID2 Sex    Species   Age Reason Fasting   WBC   RBC   HGB   HCT   MCV
      <dbl> <chr>  <chr>   <dbl> <chr>  <lgl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1         1 Female TT         11 R      TRUE      7.5  3.18  13.9  40.7  128.
2         1 Female TT         12 R      TRUE      7.4  3.68  16.2  47.9  130.
3         1 Female TT         13 R      TRUE      6.5  3.40  14.7  44.5  131.
4         1 Female TT         14 R      TRUE      9.6  3.56  15.8  45.7  128.
5         1 Female TT         15 R      TRUE      8.3  3.51  16    46.7  133.
6         1 Female TT         16 R      TRUE      8.8  3.4   14.9  43.6  128.
# ℹ 42 more variables: MCH <dbl>, MCHC <dbl>, RBCDist <dbl>, NRBC <dbl>,
#   Platelets <dbl>, MPV <dbl>, SEGS <dbl>, ACNeutrophils <dbl>, Lymphs <dbl>,
#   ACLymphocytes <dbl>, Monocytes <dbl>, ACMonocytes <dbl>, EOS <dbl>,
#   ACEosinophils <dbl>, Glucose <dbl>, BUN <dbl>, Creatinine <dbl>,
#   UricAcid <dbl>, Sodium <dbl>, Potassium <dbl>, Chloride <dbl>, CO2 <dbl>,
#   Protein <dbl>, Albumin <dbl>, Calcium <dbl>, InorgPhos <dbl>,
#   AlkPhos <dbl>, LDH <dbl>, AST <dbl>, ALT <dbl>, GGT <dbl>, …

Problem 2: In both of the datasets, the first 4 rows are used to convey information. These are problematic as they should not be confused with actual data rows during the data analysis processes.

To avoid any mistakes we can either delete these rows from the excel file before starting to the analysis, or skip these rows via the code

all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")
metadata <- read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")

# Skip the first 4 rows while reading the data
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx", skip = 4)
all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx", skip = 4)

Problem 3: In our dataset, we have two types of variables for each cell type:

Percent values: e.g., SEGS, Lymphs, Monocytes, EOS

Absolute counts: e.g., ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils

These variables are mathematically related, meaning that one set (either the percent values or the absolute counts) is redundant since I can derive one from the other using the total white blood cell count (WBC).

Why is this a problem?

If I have both percent values and absolute counts, there’s a risk that one set of variables might not align with the other, especially if the total WBC count changes or if updates are made inconsistently. For example, if the total WBC count (WBC) is recalculated or adjusted, but the percentages or absolute counts are not updated accordingly, it could result in mismatched or incorrect values. This would undermine the reliability of my analysis.
Although it would be fine with smaller datasets, redundant variables within already large datasets consume unnecessary storage space and may increase processing time.
Having two different ways to represent the same information adds unnecessary complexity to my dataset. During analysis, I would need to decide which variable to use and confirm that it is consistent with the other. For example, if I am comparing cell populations across samples, should I use the absolute counts or the percentages?

To resolve this redundancy:

We can delete one of the redundant variables within the pairs.
We may choose to retain the absolute counts (ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils) because they are more critical for the type of analysis we may intend to perform.
We can recalculate the removed variables when required. I can always calculate percent values (SEGS, Lymphs, Monocytes, EOS) when needed, via following the logic:

all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")
metadata <- read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")

#| all_dolphins_144$SEGS <- (all_dolphins_144$ACNeutrophils / all_dolphins_144$WBC) * 100
#SEGS <- (ACNeutrophils / WBC) * 100

Upon examining the dataset (sd01.xlsx), I identified an integrity issue in the column representing subspecies codes. According to the paper, two subspecies of bottlenose dolphins were investigated were Tursiops truncatus truncatus (denoted as TT) and Tursiops truncatus gilli (denoted as TG and TTG). This discrepancy creates ambiguity regarding the classification of subspecies.

The same subspecies may be represented by different codes (TG and TTTG), which violates the principle of consistent data representation, introduces uncertainty, makes it difficult to interpret the dataset accurately or use it for analysis.

In order to resolve this, we can standardise the subspecies codes to avoid ambiguity. For example:

Use TT for Tursiops truncatus truncatus.
Use TTG (or another single code) for Tursiops truncatus gilli

To resolve this issue, I propose standardizing the Species column by merging TTG into TG (or vice versa). My reasoning for this modification is based on:

Evidence from the Paper: The paper explicitly states that only two subspecies were studied.

#to change the levels
all_dolphins_144 <- all_dolphins_144 %>%
  mutate(Species = ifelse(Species == "TG", "TTG", Species))

# to verify the presence of the TT and TTG in the data
unique(all_dolphins_144$Species)

[1] "TT"   "TTTG" "TTG"

# to calculate counts and percentages for each species
species_counts <- all_dolphins_144 %>%
  count(Species) %>%
  mutate(Percentage = (n / sum(n)) * 100)

# print the result
print(species_counts)

# A tibble: 3 × 3
  Species     n Percentage
  <chr>   <int>      <dbl>
1 TT       5544      94.1 
2 TTG        90       1.53
3 TTTG      255       4.33

#checking for NA values
na_species <- sum(is.na(all_dolphins_144$Species))
cat("Number of missing values in 'Species':", na_species, "\n")

Number of missing values in 'Species': 0

However, it seems that the percentage doesn’t match with that of mentioned in the paper, as TT seems to be %94.14 and TTTG (the level that we are unsure of what it means) is %4.33. Considering that their total adds up to the %98.5 mentioned as the percentage of the TT on the article, our assumption of the TTG belonging to TTTG is wrong, because the data is misleading. Instead, TTTG belongs to TT Species.

In order to fix this, we would need to change the “TTTG” levels to “TT” levels.

all_dolphins_144 <- all_dolphins_144 %>%
  mutate(Species = ifelse(Species == "TTTG", "TT", Species))

# Recalculate the percentage distribution
species_counts <- all_dolphins_144 %>%
  count(Species) %>%
  mutate(Percentage = (n / sum(n)) * 100)

# Display the corrected percentages
print(species_counts)

# A tibble: 2 × 3
  Species     n Percentage
  <chr>   <int>      <dbl>
1 TT       5799      98.5 
2 TTG        90       1.53

Question: Display the top 6 lines of each of your new data frames.You don’t have to number these tables or to add table captions.

library(knitr)  # For kable
library(DT)     # For interactive datatables

# creating a dataset for absolute counts
all_dolphins_144_absolute <- all_dolphins_144 %>%
  select(-c(SEGS, Lymphs, Monocytes, EOS))

# creating a dataset for percentages
all_dolphins_144_percentage <- all_dolphins_144 %>%
  select(-c(ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils))

# displaying the first 6 rows of the absolute counts dataset as a styled table
cat("###  Table 1: Absolute Counts Table\n")

###  Table 1: Absolute Counts Table

kable(head(all_dolphins_144_absolute), caption = "First 6 Rows of Absolute Counts Dataset")

First 6 Rows of Absolute Counts Dataset
AnimalID	Sex	Species	Age	Reason	Fasting	LabCode	WBC	RBC	HGB	HCT	MCV	MCH	MCHC	RBCDist	NRBC	Platelets	MPV	ACNeutrophils	ACLymphocytes	ACMonocytes	ACEosinophils	Glucose	BUN	Creatinine	Sodium	Potassium	Chloride	CO2	Protein	Albumin	Calcium	InorgPhos	AlkPhos	LDH	AST	ALT	GGT	Bilirubin	Cholesterol	Triglyceride	Iron	CPK	SED60	Mg	GFR
1	F	TT	26.50	R	TRUE	6	8.4	3.20	13.6	46.6	145.6	42.5	29.2	14.1	0	69	NA	4.536	1.344	0.588	1.932	95	35	1.3	156	3.9	120	27	7.0	5.2	8.9	6.4	228	464	145	23	28	0.0	182	68	153	162	NA	1.8	221
1	F	TT	26.26	R	TRUE	6	10.7	2.98	12.7	42.8	143.6	42.6	29.7	14.4	0	80	NA	6.527	2.247	0.642	1.284	77	43	1.2	157	4.2	119	25	6.8	5.1	9.0	6.1	204	530	128	18	25	0.0	176	48	109	142	4	1.7	243
1	F	TT	26.02	R	TRUE	6	7.9	3.10	13.7	46.3	149.4	44.2	29.6	14.7	1	89	NA	4.503	1.501	0.474	1.422	95	28	1.3	154	3.7	118	26	6.7	5.2	8.9	5.3	214	445	162	22	23	0.1	214	46	167	118	4	2.1	221
1	F	TT	25.77	R	TRUE	6	6.5	3.17	13.6	46.2	145.7	42.9	29.4	13.3	0	59	NA	4.225	1.690	0.260	0.325	98	42	1.1	153	4.3	119	24	6.1	4.9	8.6	5.8	184	420	154	22	24	0.0	204	73	115	113	1	1.9	268
1	F	TT	25.56	R	TRUE	6	6.6	3.06	13.0	44.9	146.7	42.5	29.0	15.6	0	68	NA	3.696	1.056	0.396	1.452	83	34	1.3	153	3.6	118	26	6.4	5.2	8.7	5.6	204	348	176	24	24	0.2	194	69	203	109	4	2.2	221
1	F	TT	25.52	R	TRUE	6	7.2	2.86	12.7	41.8	146.2	44.4	30.4	15.2	0	56	NA	3.384	1.152	0.576	2.088	98	44	1.1	155	4.0	124	24	5.9	5.1	8.4	5.9	206	468	169	25	23	0.0	201	84	158	156	2	1.8	268

# displaying the first 6 rows of the percentages dataset as a styled table
cat("\n### Table 2: Percentages Table\n")


### Table 2: Percentages Table

kable(head(all_dolphins_144_percentage), caption = "First 6 Rows of Percentages Dataset")

First 6 Rows of Percentages Dataset
AnimalID	Sex	Species	Age	Reason	Fasting	LabCode	WBC	RBC	HGB	HCT	MCV	MCH	MCHC	RBCDist	NRBC	Platelets	MPV	SEGS	Lymphs	Monocytes	EOS	Glucose	BUN	Creatinine	Sodium	Potassium	Chloride	CO2	Protein	Albumin	Calcium	InorgPhos	AlkPhos	LDH	AST	ALT	GGT	Bilirubin	Cholesterol	Triglyceride	Iron	CPK	SED60	Mg	GFR
1	F	TT	26.50	R	TRUE	6	8.4	3.20	13.6	46.6	145.6	42.5	29.2	14.1	0	69	NA	54	16	7	23	95	35	1.3	156	3.9	120	27	7.0	5.2	8.9	6.4	228	464	145	23	28	0.0	182	68	153	162	NA	1.8	221
1	F	TT	26.26	R	TRUE	6	10.7	2.98	12.7	42.8	143.6	42.6	29.7	14.4	0	80	NA	61	21	6	12	77	43	1.2	157	4.2	119	25	6.8	5.1	9.0	6.1	204	530	128	18	25	0.0	176	48	109	142	4	1.7	243
1	F	TT	26.02	R	TRUE	6	7.9	3.10	13.7	46.3	149.4	44.2	29.6	14.7	1	89	NA	57	19	6	18	95	28	1.3	154	3.7	118	26	6.7	5.2	8.9	5.3	214	445	162	22	23	0.1	214	46	167	118	4	2.1	221
1	F	TT	25.77	R	TRUE	6	6.5	3.17	13.6	46.2	145.7	42.9	29.4	13.3	0	59	NA	65	26	4	5	98	42	1.1	153	4.3	119	24	6.1	4.9	8.6	5.8	184	420	154	22	24	0.0	204	73	115	113	1	1.9	268
1	F	TT	25.56	R	TRUE	6	6.6	3.06	13.0	44.9	146.7	42.5	29.0	15.6	0	68	NA	56	16	6	22	83	34	1.3	153	3.6	118	26	6.4	5.2	8.7	5.6	204	348	176	24	24	0.2	194	69	203	109	4	2.2	221
1	F	TT	25.52	R	TRUE	6	7.2	2.86	12.7	41.8	146.2	44.4	30.4	15.2	0	56	NA	47	16	8	29	98	44	1.1	155	4.0	124	24	5.9	5.1	8.4	5.9	206	468	169	25	23	0.0	201	84	158	156	2	1.8	268

Part II. Checking Mid age population, results and conclusions

Question: “Calculate the minimal age, maximal age and number of years (period) over which each of the Mid age animals (the ones from table sd02.xlsx) were followed. Also give a summary of these characteristics using the summary() function. You don’t have to make a proper table of the output of the summary() function.”

# calculate minimal, maximal age, and period for each AnimalID2
age_summary <- subset_34 %>%
  group_by(AnimalID2) %>%
  summarise(
    Min_Age = min(Age, na.rm = TRUE),  # Minimal age
    Max_Age = max(Age, na.rm = TRUE),  # Maximal age
    Period = Max_Age - Min_Age         # Period (years followed)
  )

# the calculated summary for each animal
print(age_summary)

# A tibble: 34 × 4
   AnimalID2 Min_Age Max_Age Period
       <dbl>   <dbl>   <dbl>  <dbl>
 1         1      11      26     15
 2         2      15      26     11
 3         3      15      36     21
 4         6      16      37     21
 5         7      13      34     21
 6         9      14      27     13
 7        11      17      38     21
 8        12      14      30     16
 9        13      14      31     17
10        14      17      38     21
# ℹ 24 more rows

summary(age_summary)

   AnimalID2        Min_Age      Max_Age          Period     
 Min.   : 1.00   Min.   :10   Min.   :24.00   Min.   :11.00  
 1st Qu.:13.25   1st Qu.:12   1st Qu.:31.00   1st Qu.:16.00  
 Median :25.00   Median :14   Median :33.00   Median :21.00  
 Mean   :26.09   Mean   :14   Mean   :32.44   Mean   :18.44  
 3rd Qu.:39.75   3rd Qu.:16   3rd Qu.:34.75   3rd Qu.:21.00  
 Max.   :53.00   Max.   :21   Max.   :40.00   Max.   :21.00

The AnimalID2 values range from 1 to 53, indicating the unique identifiers for the animals studied.

Min_Age

Minimum Age: The youngest starting age in the dataset is 10 years.
Maximum Age: The oldest starting age is 21 years.

Max_Age

Minimum Max Age: The youngest ending age is 24 years, while the oldest is 40 years.
Mean Max Age: The average maximum age is 32.44 years.

Period

Minimum Period: The shortest follow-up period is 11 years, while the longest is 21 years.
Mean Period: On average, animals were followed for 18.44 years.
Median Period: The median follow-up period is 21 years, indicating that most animals were observed for close to two decades.

Question: “Formulate objective selection criteria for membership of the Mid age group based on the outcome of your previous analysis. Keep in mind that the goal is: [1] To make observations in early adulthood (adulthood starts between 5 and 12 years, earlier with females than males). [2] To follow these animals at least for a significant number of years into adulthood and “old age”. For wild bottlenose dolphins 25 years is old, but not when they live in captivity. [3] Not to deviate too much from the apparent criteria used by the authors. [4] For simplicity, use the same criteria for males and females.”

Answer: We found that the age of the animals when they first started to be measured was when they were 10 years old and the maximum age was when they were 21 years old. This does not align with the first goal of keeping this range within 5-12 years of age.

Our suggestion is to align with the objective selection criteria of early adulthood by selecting 5 years as the minimum eligible age group to be included in the subset. Also keeping the 40 years of age as a maximal age for the Animal’s data to be included in the sample.

Question: “Apply these criteria on all 144 dolphins to create your own Objective mid age selection and register your selection in a separate column. Again, display the first 6 lines of the modified data table. You don’t have to number this table or to add table captions”

#|label: tab-taba
#|tab-cap: "Filtered Dataset: Ages Between 5 and 40"

# Filter the dataset to include ages between 5 and 40
subset_q3 <- all_dolphins_144 %>%
  filter(Age >= 5 & Age <= 40)

# Display the first few rows of the filtered dataset as a table
library(knitr)
cat("### Filtered Dataset: Ages Between 5 and 40\n")

### Filtered Dataset: Ages Between 5 and 40

kable(head(subset_q3), caption = "First Few Rows of Filtered Dataset")

First Few Rows of Filtered Dataset
AnimalID	Sex	Species	Age	Reason	Fasting	LabCode	WBC	RBC	HGB	HCT	MCV	MCH	MCHC	RBCDist	NRBC	Platelets	MPV	SEGS	ACNeutrophils	Lymphs	ACLymphocytes	Monocytes	ACMonocytes	EOS	ACEosinophils	Glucose	BUN	Creatinine	Sodium	Potassium	Chloride	CO2	Protein	Albumin	Calcium	InorgPhos	AlkPhos	LDH	AST	ALT	GGT	Bilirubin	Cholesterol	Triglyceride	Iron	CPK	SED60	Mg	GFR
1	F	TT	26.50	R	TRUE	6	8.4	3.20	13.6	46.6	145.6	42.5	29.2	14.1	0	69	NA	54	4.536	16	1.344	7	0.588	23	1.932	95	35	1.3	156	3.9	120	27	7.0	5.2	8.9	6.4	228	464	145	23	28	0.0	182	68	153	162	NA	1.8	221
1	F	TT	26.26	R	TRUE	6	10.7	2.98	12.7	42.8	143.6	42.6	29.7	14.4	0	80	NA	61	6.527	21	2.247	6	0.642	12	1.284	77	43	1.2	157	4.2	119	25	6.8	5.1	9.0	6.1	204	530	128	18	25	0.0	176	48	109	142	4	1.7	243
1	F	TT	26.02	R	TRUE	6	7.9	3.10	13.7	46.3	149.4	44.2	29.6	14.7	1	89	NA	57	4.503	19	1.501	6	0.474	18	1.422	95	28	1.3	154	3.7	118	26	6.7	5.2	8.9	5.3	214	445	162	22	23	0.1	214	46	167	118	4	2.1	221
1	F	TT	25.77	R	TRUE	6	6.5	3.17	13.6	46.2	145.7	42.9	29.4	13.3	0	59	NA	65	4.225	26	1.690	4	0.260	5	0.325	98	42	1.1	153	4.3	119	24	6.1	4.9	8.6	5.8	184	420	154	22	24	0.0	204	73	115	113	1	1.9	268
1	F	TT	25.56	R	TRUE	6	6.6	3.06	13.0	44.9	146.7	42.5	29.0	15.6	0	68	NA	56	3.696	16	1.056	6	0.396	22	1.452	83	34	1.3	153	3.6	118	26	6.4	5.2	8.7	5.6	204	348	176	24	24	0.2	194	69	203	109	4	2.2	221
1	F	TT	25.52	R	TRUE	6	7.2	2.86	12.7	41.8	146.2	44.4	30.4	15.2	0	56	NA	47	3.384	16	1.152	8	0.576	29	2.088	98	44	1.1	155	4.0	124	24	5.9	5.1	8.4	5.9	206	468	169	25	23	0.0	201	84	158	156	2	1.8	268

# Summary of the Age variable
cat("\n### Summary of Age Variable\n")


### Summary of Age Variable

summary(subset_q3$Age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.02   14.66   22.11   22.30   30.32   39.90

Question: “Summarize similarities and differences between the authors Mid age group and the Objective mid age group in a table”.

AgeGroup_Summary_Subset34 <- subset_34 %>%
  group_by(agegroup) %>%
  summarise(
    Min_Age = min(Age, na.rm = TRUE),  # Minimal age
    Max_Age = max(Age, na.rm = TRUE),  # Maximal age
    Period = Max_Age - Min_Age         # Period (years followed)
  )

# Display the summary as a clean table
kable(
  AgeGroup_Summary_Subset34,
  caption = "Summary of Age Groups in Subset of 34 Dolphins",
  format = "html"
)

Summary of Age Groups in Subset of 34 Dolphins
agegroup	Min_Age	Max_Age	Period
2	10	13	3
3	14	32	18
4	33	37	4
5	38	40	2

# Display the calculated summary for each animal
#print(AgeGroup_Summary_Subset34)

Since there is no Age Group Column in the sd01 dataset, we proceeded to create a new column within the Objective Dataset (named as “subset_q3”), to group the ages from 1 to 5 as it was similarly the case in the sd02 dataset.

subset_q3 <- subset_q3 %>%
  mutate(
    agegroup = case_when(
      Age >= 5 & Age <= 9.99 ~ 1,
      Age >= 10 & Age <= 13.99 ~ 2,
      Age >= 14 & Age <= 32.99 ~ 3,
      Age >= 33 & Age <= 37.99 ~ 4,
      Age >= 38 & Age <= 40 ~ 5,
      TRUE ~ NA_real_  # Assign NA for ages outside these ranges
    )
  )

# View the updated dataset
head(subset_q3)

# A tibble: 6 × 52
  AnimalID Sex   Species   Age Reason Fasting LabCode   WBC   RBC   HGB   HCT
     <dbl> <chr> <chr>   <dbl> <chr>  <lgl>     <dbl> <dbl> <dbl> <dbl> <dbl>
1        1 F     TT       26.5 R      TRUE          6   8.4  3.2   13.6  46.6
2        1 F     TT       26.3 R      TRUE          6  10.7  2.98  12.7  42.8
3        1 F     TT       26.0 R      TRUE          6   7.9  3.1   13.7  46.3
4        1 F     TT       25.8 R      TRUE          6   6.5  3.17  13.6  46.2
5        1 F     TT       25.6 R      TRUE          6   6.6  3.06  13    44.9
6        1 F     TT       25.5 R      TRUE          6   7.2  2.86  12.7  41.8
# ℹ 41 more variables: MCV <dbl>, MCH <dbl>, MCHC <dbl>, RBCDist <dbl>,
#   NRBC <dbl>, Platelets <dbl>, MPV <dbl>, SEGS <dbl>, ACNeutrophils <dbl>,
#   Lymphs <dbl>, ACLymphocytes <dbl>, Monocytes <dbl>, ACMonocytes <dbl>,
#   EOS <dbl>, ACEosinophils <dbl>, Glucose <dbl>, BUN <dbl>, Creatinine <dbl>,
#   UricAcid <dbl>, Sodium <dbl>, Potassium <dbl>, Chloride <dbl>, CO2 <dbl>,
#   Protein <dbl>, Albumin <dbl>, Calcium <dbl>, InorgPhos <dbl>,
#   AlkPhos <dbl>, LDH <dbl>, AST <dbl>, ALT <dbl>, GGT <dbl>, …

AgeGroup_Summary_Subset_q3 <- subset_q3 %>%
  group_by(agegroup) %>%
  summarise(
    Min_Age = min(Age, na.rm = TRUE),  # Minimal age
    Max_Age = max(Age, na.rm = TRUE),  # Maximal age
    Period = Max_Age - Min_Age         # Period (years followed)
  )
# Display the summary as a clean table
kable(
  AgeGroup_Summary_Subset_q3,
  caption = "Summary of Age Groups in Objective Dataset",
  format = "html"
)

Summary of Age Groups in Objective Dataset
agegroup	Min_Age	Max_Age	Period
1	5.02	9.89	4.87
2	10.00	13.91	3.91
3	14.00	32.91	18.91
4	33.00	37.92	4.92
5	38.02	39.90	1.88

# Display the calculated summary for each animal
#print(AgeGroup_Summary_Subset_q3)

However, in the subset dataset, we realised that there was no age group 1 (fromn age 5 to 9.99), which have led to NA values upon our comparison between these two datasets.

#Join the two dataframe using a outer join
Comparison_table <- merge(x=AgeGroup_Summary_Subset_q3, y=AgeGroup_Summary_Subset34,
                          by="agegroup", all = TRUE)

print(Comparison_table)

  agegroup Min_Age.x Max_Age.x Period.x Min_Age.y Max_Age.y Period.y
1        1      5.02      9.89     4.87        NA        NA       NA
2        2     10.00     13.91     3.91        10        13        3
3        3     14.00     32.91    18.91        14        32       18
4        4     33.00     37.92     4.92        33        37        4
5        5     38.02     39.90     1.88        38        40        2

change the x and y values for clarity on the table

Question: “By making a suitable graph of the results of the authors Mid age group (the selection of animals from sd02.xlsx) confirm or decline the conclusion by the authors that females apparently age faster than most males when taking into account hemoglobin content only. Use all data for these dolphins, not just the subset of data listed in sd02.xlsx. Tip: use the ggplot2 functions geom_smooth() to draw straight (!) trend lines and facet_wrap() to make separate plots for males and females.”

library(ggplot2)
#|label: fig-hemoglobin
#|fig-cap: "Hemoglobin Amount: Age by Sex"
library(ggplot2)

# Prepare the plot data
plot_data <- all_dolphins_144 %>%
  filter(!is.na(HGB), !is.na(Age), !is.na(Sex), Age >= 10 & Age <= 40)  # Remove rows with missing values

# Create the plot
ggplot(plot_data, aes(x = Age, y = HGB, color = Sex)) +
  geom_point(alpha = 0.6) +  # Scatter points
  geom_smooth(method = "lm", se = TRUE, color = "black") +  # Black trend lines
  facet_wrap(~Sex) +  # Separate plots for males and females
  labs(
    title = "Hemoglobin Content vs. Age by Sex",
    x = "Age (years)",
    y = "Hemoglobin Content (HGB)"
  ) +
  theme_minimal()

Our analysis reveals that hemoglobin (HGB) content decreases significantly with increasing age in females, exhibiting a steeper decline compared to males. We can therefore confirm that the authors are correct. The trend on the ?@fig-hemoglobin suggests that females may experience a faster reduction in hemoglobin content as they age, aligning with the hypothesis that females age more quickly in terms of hemoglobin levels. In contrast, the decline in hemoglobin content for males appears to be less pronounced, indicating a slower aging process in this biomarker.