Assignment 1

Author

Zilan Oz

Modified

November 1, 2024

Part I. Checking the data

Problem 1: In the subset file “sd02”, there are two dedicated columns for conveying “Sex” information of the animals:

  • One named as “Sex” and given “M” and “F” values.

  • The other named as “sex2” and given values 0 and 1.

We can use the Sex column (M for Male, F for Female) as the reference and update or create a new column to replace sex2 (0 and 1) with consistent, meaningful labels.

options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("readr")

The downloaded binary packages are in
    /var/folders/xh/l8syzm3s185d0wjdjzy_lyd00000gn/T//RtmpEfSGUr/downloaded_packages
library(tidyverse)
library(datasets)
library(readxl)
library(readr)
library(readxl)
library(dplyr)

all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")
metadata <- read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")


subset_34 <- subset_34 %>%
  mutate(
    Sex = case_when(
      Sex == "M" ~ "Male",    # Replace "M" with "Male"
      Sex == "F" ~ "Female",  # Replace "F" with "Female"
      TRUE ~ NA_character_    # Handle any unexpected values
    ),
    sex2 = case_when(
      Sex == "Male" ~ 1,      # Male is encoded as 1
      Sex == "Female" ~ 0,    # Female is encoded as 0
      TRUE ~ NA_real_         # Handle any unexpected values
    )
  )

# View the updated dataset
head(subset_34)
# A tibble: 6 × 53
  AnimalID2 Sex    Species   Age Reason Fasting   WBC   RBC   HGB   HCT   MCV
      <dbl> <chr>  <chr>   <dbl> <chr>  <lgl>   <dbl> <dbl> <dbl> <dbl> <dbl>
1         1 Female TT         11 R      TRUE      7.5  3.18  13.9  40.7  128.
2         1 Female TT         12 R      TRUE      7.4  3.68  16.2  47.9  130.
3         1 Female TT         13 R      TRUE      6.5  3.40  14.7  44.5  131.
4         1 Female TT         14 R      TRUE      9.6  3.56  15.8  45.7  128.
5         1 Female TT         15 R      TRUE      8.3  3.51  16    46.7  133.
6         1 Female TT         16 R      TRUE      8.8  3.4   14.9  43.6  128.
# ℹ 42 more variables: MCH <dbl>, MCHC <dbl>, RBCDist <dbl>, NRBC <dbl>,
#   Platelets <dbl>, MPV <dbl>, SEGS <dbl>, ACNeutrophils <dbl>, Lymphs <dbl>,
#   ACLymphocytes <dbl>, Monocytes <dbl>, ACMonocytes <dbl>, EOS <dbl>,
#   ACEosinophils <dbl>, Glucose <dbl>, BUN <dbl>, Creatinine <dbl>,
#   UricAcid <dbl>, Sodium <dbl>, Potassium <dbl>, Chloride <dbl>, CO2 <dbl>,
#   Protein <dbl>, Albumin <dbl>, Calcium <dbl>, InorgPhos <dbl>,
#   AlkPhos <dbl>, LDH <dbl>, AST <dbl>, ALT <dbl>, GGT <dbl>, …

Problem 2: In both of the datasets, the first 4 rows are used to convey information. These are problematic as they should not be confused with actual data rows during the data analysis processes.

  • To avoid any mistakes we can either delete these rows from the excel file before starting to the analysis, or skip these rows via the code
all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")
metadata <- read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")

# Skip the first 4 rows while reading the data
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx", skip = 4)
all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx", skip = 4)

Problem 3: In our dataset, we have two types of variables for each cell type:

Percent values: e.g., SEGS, Lymphs, Monocytes, EOS

Absolute counts: e.g., ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils

These variables are mathematically related, meaning that one set (either the percent values or the absolute counts) is redundant since I can derive one from the other using the total white blood cell count (WBC).

Why is this a problem?

  1. If I have both percent values and absolute counts, there’s a risk that one set of variables might not align with the other, especially if the total WBC count changes or if updates are made inconsistently. For example, if the total WBC count (WBC) is recalculated or adjusted, but the percentages or absolute counts are not updated accordingly, it could result in mismatched or incorrect values. This would undermine the reliability of my analysis.
  2. Although it would be fine with smaller datasets, redundant variables within already large datasets consume unnecessary storage space and may increase processing time.
  3. Having two different ways to represent the same information adds unnecessary complexity to my dataset. During analysis, I would need to decide which variable to use and confirm that it is consistent with the other. For example, if I am comparing cell populations across samples, should I use the absolute counts or the percentages?

To resolve this redundancy:

  1. We can delete one of the redundant variables within the pairs.
    We may choose to retain the absolute counts (ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils) because they are more critical for the type of analysis we may intend to perform.

  2. We can recalculate the removed variables when required. I can always calculate percent values (SEGS, Lymphs, Monocytes, EOS) when needed, via following the logic:

all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")
subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")
metadata <- read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")

#| all_dolphins_144$SEGS <- (all_dolphins_144$ACNeutrophils / all_dolphins_144$WBC) * 100
#SEGS <- (ACNeutrophils / WBC) * 100

Upon examining the dataset (sd01.xlsx), I identified an integrity issue in the column representing subspecies codes. According to the paper, two subspecies of bottlenose dolphins were investigated were Tursiops truncatus truncatus (denoted as TT) and Tursiops truncatus gilli (denoted as TG and TTG). This discrepancy creates ambiguity regarding the classification of subspecies.

The same subspecies may be represented by different codes (TG and TTTG), which violates the principle of consistent data representation, introduces uncertainty, makes it difficult to interpret the dataset accurately or use it for analysis.

In order to resolve this, we can standardise the subspecies codes to avoid ambiguity. For example:

  • Use TT for Tursiops truncatus truncatus.

  • Use TTG (or another single code) for Tursiops truncatus gilli

To resolve this issue, I propose standardizing the Species column by merging TTG into TG (or vice versa). My reasoning for this modification is based on:

  • Evidence from the Paper: The paper explicitly states that only two subspecies were studied.
#to change the levels
all_dolphins_144 <- all_dolphins_144 %>%
  mutate(Species = ifelse(Species == "TG", "TTG", Species))

# to verify the presence of the TT and TTG in the data
unique(all_dolphins_144$Species)
[1] "TT"   "TTTG" "TTG" 
# to calculate counts and percentages for each species
species_counts <- all_dolphins_144 %>%
  count(Species) %>%
  mutate(Percentage = (n / sum(n)) * 100)

# print the result
print(species_counts)
# A tibble: 3 × 3
  Species     n Percentage
  <chr>   <int>      <dbl>
1 TT       5544      94.1 
2 TTG        90       1.53
3 TTTG      255       4.33
#checking for NA values
na_species <- sum(is.na(all_dolphins_144$Species))
cat("Number of missing values in 'Species':", na_species, "\n")
Number of missing values in 'Species': 0 

However, it seems that the percentage doesn’t match with that of mentioned in the paper, as TT seems to be %94.14 and TTTG (the level that we are unsure of what it means) is %4.33. Considering that their total adds up to the %98.5 mentioned as the percentage of the TT on the article, our assumption of the TTG belonging to TTTG is wrong, because the data is misleading. Instead, TTTG belongs to TT Species.

In order to fix this, we would need to change the “TTTG” levels to “TT” levels.

all_dolphins_144 <- all_dolphins_144 %>%
  mutate(Species = ifelse(Species == "TTTG", "TT", Species))

# Recalculate the percentage distribution
species_counts <- all_dolphins_144 %>%
  count(Species) %>%
  mutate(Percentage = (n / sum(n)) * 100)

# Display the corrected percentages
print(species_counts)
# A tibble: 2 × 3
  Species     n Percentage
  <chr>   <int>      <dbl>
1 TT       5799      98.5 
2 TTG        90       1.53

Question: Display the top 6 lines of each of your new data frames.You don’t have to number these tables or to add table captions.

library(knitr)  # For kable
library(DT)     # For interactive datatables

# creating a dataset for absolute counts
all_dolphins_144_absolute <- all_dolphins_144 %>%
  select(-c(SEGS, Lymphs, Monocytes, EOS))

# creating a dataset for percentages
all_dolphins_144_percentage <- all_dolphins_144 %>%
  select(-c(ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils))

# displaying the first 6 rows of the absolute counts dataset as a styled table
cat("###  Table 1: Absolute Counts Table\n")
###  Table 1: Absolute Counts Table
kable(head(all_dolphins_144_absolute), caption = "First 6 Rows of Absolute Counts Dataset")
First 6 Rows of Absolute Counts Dataset
AnimalID Sex Species Age Reason Fasting LabCode WBC RBC HGB HCT MCV MCH MCHC RBCDist NRBC Platelets MPV ACNeutrophils ACLymphocytes ACMonocytes ACEosinophils Glucose BUN Creatinine UricAcid Sodium Potassium Chloride CO2 Protein Albumin Calcium InorgPhos AlkPhos LDH AST ALT GGT Bilirubin Cholesterol Triglyceride Iron CPK SED60 Mg GFR
1 F TT 26.50 R TRUE 6 8.4 3.20 13.6 46.6 145.6 42.5 29.2 14.1 0 69 NA 4.536 1.344 0.588 1.932 95 35 1.3 0 156 3.9 120 27 7.0 5.2 8.9 6.4 228 464 145 23 28 0.0 182 68 153 162 NA 1.8 221
1 F TT 26.26 R TRUE 6 10.7 2.98 12.7 42.8 143.6 42.6 29.7 14.4 0 80 NA 6.527 2.247 0.642 1.284 77 43 1.2 0 157 4.2 119 25 6.8 5.1 9.0 6.1 204 530 128 18 25 0.0 176 48 109 142 4 1.7 243
1 F TT 26.02 R TRUE 6 7.9 3.10 13.7 46.3 149.4 44.2 29.6 14.7 1 89 NA 4.503 1.501 0.474 1.422 95 28 1.3 0 154 3.7 118 26 6.7 5.2 8.9 5.3 214 445 162 22 23 0.1 214 46 167 118 4 2.1 221
1 F TT 25.77 R TRUE 6 6.5 3.17 13.6 46.2 145.7 42.9 29.4 13.3 0 59 NA 4.225 1.690 0.260 0.325 98 42 1.1 0 153 4.3 119 24 6.1 4.9 8.6 5.8 184 420 154 22 24 0.0 204 73 115 113 1 1.9 268
1 F TT 25.56 R TRUE 6 6.6 3.06 13.0 44.9 146.7 42.5 29.0 15.6 0 68 NA 3.696 1.056 0.396 1.452 83 34 1.3 0 153 3.6 118 26 6.4 5.2 8.7 5.6 204 348 176 24 24 0.2 194 69 203 109 4 2.2 221
1 F TT 25.52 R TRUE 6 7.2 2.86 12.7 41.8 146.2 44.4 30.4 15.2 0 56 NA 3.384 1.152 0.576 2.088 98 44 1.1 0 155 4.0 124 24 5.9 5.1 8.4 5.9 206 468 169 25 23 0.0 201 84 158 156 2 1.8 268
# displaying the first 6 rows of the percentages dataset as a styled table
cat("\n### Table 2: Percentages Table\n")

### Table 2: Percentages Table
kable(head(all_dolphins_144_percentage), caption = "First 6 Rows of Percentages Dataset")
First 6 Rows of Percentages Dataset
AnimalID Sex Species Age Reason Fasting LabCode WBC RBC HGB HCT MCV MCH MCHC RBCDist NRBC Platelets MPV SEGS Lymphs Monocytes EOS Glucose BUN Creatinine UricAcid Sodium Potassium Chloride CO2 Protein Albumin Calcium InorgPhos AlkPhos LDH AST ALT GGT Bilirubin Cholesterol Triglyceride Iron CPK SED60 Mg GFR
1 F TT 26.50 R TRUE 6 8.4 3.20 13.6 46.6 145.6 42.5 29.2 14.1 0 69 NA 54 16 7 23 95 35 1.3 0 156 3.9 120 27 7.0 5.2 8.9 6.4 228 464 145 23 28 0.0 182 68 153 162 NA 1.8 221
1 F TT 26.26 R TRUE 6 10.7 2.98 12.7 42.8 143.6 42.6 29.7 14.4 0 80 NA 61 21 6 12 77 43 1.2 0 157 4.2 119 25 6.8 5.1 9.0 6.1 204 530 128 18 25 0.0 176 48 109 142 4 1.7 243
1 F TT 26.02 R TRUE 6 7.9 3.10 13.7 46.3 149.4 44.2 29.6 14.7 1 89 NA 57 19 6 18 95 28 1.3 0 154 3.7 118 26 6.7 5.2 8.9 5.3 214 445 162 22 23 0.1 214 46 167 118 4 2.1 221
1 F TT 25.77 R TRUE 6 6.5 3.17 13.6 46.2 145.7 42.9 29.4 13.3 0 59 NA 65 26 4 5 98 42 1.1 0 153 4.3 119 24 6.1 4.9 8.6 5.8 184 420 154 22 24 0.0 204 73 115 113 1 1.9 268
1 F TT 25.56 R TRUE 6 6.6 3.06 13.0 44.9 146.7 42.5 29.0 15.6 0 68 NA 56 16 6 22 83 34 1.3 0 153 3.6 118 26 6.4 5.2 8.7 5.6 204 348 176 24 24 0.2 194 69 203 109 4 2.2 221
1 F TT 25.52 R TRUE 6 7.2 2.86 12.7 41.8 146.2 44.4 30.4 15.2 0 56 NA 47 16 8 29 98 44 1.1 0 155 4.0 124 24 5.9 5.1 8.4 5.9 206 468 169 25 23 0.0 201 84 158 156 2 1.8 268

Part II. Checking Mid age population, results and conclusions

Question:Calculate the minimal age, maximal age and number of years (period) over which each of the Mid age animals (the ones from table sd02.xlsx) were followed. Also give a summary of these characteristics using the summary() function. You don’t have to make a proper table of the output of the summary() function.”

# calculate minimal, maximal age, and period for each AnimalID2
age_summary <- subset_34 %>%
  group_by(AnimalID2) %>%
  summarise(
    Min_Age = min(Age, na.rm = TRUE),  # Minimal age
    Max_Age = max(Age, na.rm = TRUE),  # Maximal age
    Period = Max_Age - Min_Age         # Period (years followed)
  )

# the calculated summary for each animal
print(age_summary)
# A tibble: 34 × 4
   AnimalID2 Min_Age Max_Age Period
       <dbl>   <dbl>   <dbl>  <dbl>
 1         1      11      26     15
 2         2      15      26     11
 3         3      15      36     21
 4         6      16      37     21
 5         7      13      34     21
 6         9      14      27     13
 7        11      17      38     21
 8        12      14      30     16
 9        13      14      31     17
10        14      17      38     21
# ℹ 24 more rows
summary(age_summary)
   AnimalID2        Min_Age      Max_Age          Period     
 Min.   : 1.00   Min.   :10   Min.   :24.00   Min.   :11.00  
 1st Qu.:13.25   1st Qu.:12   1st Qu.:31.00   1st Qu.:16.00  
 Median :25.00   Median :14   Median :33.00   Median :21.00  
 Mean   :26.09   Mean   :14   Mean   :32.44   Mean   :18.44  
 3rd Qu.:39.75   3rd Qu.:16   3rd Qu.:34.75   3rd Qu.:21.00  
 Max.   :53.00   Max.   :21   Max.   :40.00   Max.   :21.00  

The AnimalID2 values range from 1 to 53, indicating the unique identifiers for the animals studied.

Min_Age

  • Minimum Age: The youngest starting age in the dataset is 10 years.

  • Maximum Age: The oldest starting age is 21 years.

Max_Age

  • Minimum Max Age: The youngest ending age is 24 years, while the oldest is 40 years.

  • Mean Max Age: The average maximum age is 32.44 years.

Period

  • Minimum Period: The shortest follow-up period is 11 years, while the longest is 21 years.

  • Mean Period: On average, animals were followed for 18.44 years.

  • Median Period: The median follow-up period is 21 years, indicating that most animals were observed for close to two decades.

Question:Formulate objective selection criteria for membership of the Mid age group based on the outcome of your previous analysis. Keep in mind that the goal is: [1] To make observations in early adulthood (adulthood starts between 5 and 12 years, earlier with females than males). [2] To follow these animals at least for a significant number of years into adulthood and “old age”. For wild bottlenose dolphins 25 years is old, but not when they live in captivity. [3] Not to deviate too much from the apparent criteria used by the authors. [4] For simplicity, use the same criteria for males and females.”

Answer: We found that the age of the animals when they first started to be measured was when they were 10 years old and the maximum age was when they were 21 years old. This does not align with the first goal of keeping this range within 5-12 years of age.

  • Our suggestion is to align with the objective selection criteria of early adulthood by selecting 5 years as the minimum eligible age group to be included in the subset. Also keeping the 40 years of age as a maximal age for the Animal’s data to be included in the sample.

Question: “Apply these criteria on all 144 dolphins to create your own Objective mid age selection and register your selection in a separate column. Again, display the first 6 lines of the modified data table. You don’t have to number this table or to add table captions”

#|label: tab-taba
#|tab-cap: "Filtered Dataset: Ages Between 5 and 40"

# Filter the dataset to include ages between 5 and 40
subset_q3 <- all_dolphins_144 %>%
  filter(Age >= 5 & Age <= 40)

# Display the first few rows of the filtered dataset as a table
library(knitr)
cat("### Filtered Dataset: Ages Between 5 and 40\n")
### Filtered Dataset: Ages Between 5 and 40
kable(head(subset_q3), caption = "First Few Rows of Filtered Dataset")
First Few Rows of Filtered Dataset
AnimalID Sex Species Age Reason Fasting LabCode WBC RBC HGB HCT MCV MCH MCHC RBCDist NRBC Platelets MPV SEGS ACNeutrophils Lymphs ACLymphocytes Monocytes ACMonocytes EOS ACEosinophils Glucose BUN Creatinine UricAcid Sodium Potassium Chloride CO2 Protein Albumin Calcium InorgPhos AlkPhos LDH AST ALT GGT Bilirubin Cholesterol Triglyceride Iron CPK SED60 Mg GFR
1 F TT 26.50 R TRUE 6 8.4 3.20 13.6 46.6 145.6 42.5 29.2 14.1 0 69 NA 54 4.536 16 1.344 7 0.588 23 1.932 95 35 1.3 0 156 3.9 120 27 7.0 5.2 8.9 6.4 228 464 145 23 28 0.0 182 68 153 162 NA 1.8 221
1 F TT 26.26 R TRUE 6 10.7 2.98 12.7 42.8 143.6 42.6 29.7 14.4 0 80 NA 61 6.527 21 2.247 6 0.642 12 1.284 77 43 1.2 0 157 4.2 119 25 6.8 5.1 9.0 6.1 204 530 128 18 25 0.0 176 48 109 142 4 1.7 243
1 F TT 26.02 R TRUE 6 7.9 3.10 13.7 46.3 149.4 44.2 29.6 14.7 1 89 NA 57 4.503 19 1.501 6 0.474 18 1.422 95 28 1.3 0 154 3.7 118 26 6.7 5.2 8.9 5.3 214 445 162 22 23 0.1 214 46 167 118 4 2.1 221
1 F TT 25.77 R TRUE 6 6.5 3.17 13.6 46.2 145.7 42.9 29.4 13.3 0 59 NA 65 4.225 26 1.690 4 0.260 5 0.325 98 42 1.1 0 153 4.3 119 24 6.1 4.9 8.6 5.8 184 420 154 22 24 0.0 204 73 115 113 1 1.9 268
1 F TT 25.56 R TRUE 6 6.6 3.06 13.0 44.9 146.7 42.5 29.0 15.6 0 68 NA 56 3.696 16 1.056 6 0.396 22 1.452 83 34 1.3 0 153 3.6 118 26 6.4 5.2 8.7 5.6 204 348 176 24 24 0.2 194 69 203 109 4 2.2 221
1 F TT 25.52 R TRUE 6 7.2 2.86 12.7 41.8 146.2 44.4 30.4 15.2 0 56 NA 47 3.384 16 1.152 8 0.576 29 2.088 98 44 1.1 0 155 4.0 124 24 5.9 5.1 8.4 5.9 206 468 169 25 23 0.0 201 84 158 156 2 1.8 268
# Summary of the Age variable
cat("\n### Summary of Age Variable\n")

### Summary of Age Variable
summary(subset_q3$Age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   5.02   14.66   22.11   22.30   30.32   39.90 

Question:Summarize similarities and differences between the authors Mid age group and the Objective mid age group in a table”.

AgeGroup_Summary_Subset34 <- subset_34 %>%
  group_by(agegroup) %>%
  summarise(
    Min_Age = min(Age, na.rm = TRUE),  # Minimal age
    Max_Age = max(Age, na.rm = TRUE),  # Maximal age
    Period = Max_Age - Min_Age         # Period (years followed)
  )

# Display the summary as a clean table
kable(
  AgeGroup_Summary_Subset34,
  caption = "Summary of Age Groups in Subset of 34 Dolphins",
  format = "html"
)
Summary of Age Groups in Subset of 34 Dolphins
agegroup Min_Age Max_Age Period
2 10 13 3
3 14 32 18
4 33 37 4
5 38 40 2
# Display the calculated summary for each animal
#print(AgeGroup_Summary_Subset34)

Since there is no Age Group Column in the sd01 dataset, we proceeded to create a new column within the Objective Dataset (named as “subset_q3”), to group the ages from 1 to 5 as it was similarly the case in the sd02 dataset.

subset_q3 <- subset_q3 %>%
  mutate(
    agegroup = case_when(
      Age >= 5 & Age <= 9.99 ~ 1,
      Age >= 10 & Age <= 13.99 ~ 2,
      Age >= 14 & Age <= 32.99 ~ 3,
      Age >= 33 & Age <= 37.99 ~ 4,
      Age >= 38 & Age <= 40 ~ 5,
      TRUE ~ NA_real_  # Assign NA for ages outside these ranges
    )
  )

# View the updated dataset
head(subset_q3)
# A tibble: 6 × 52
  AnimalID Sex   Species   Age Reason Fasting LabCode   WBC   RBC   HGB   HCT
     <dbl> <chr> <chr>   <dbl> <chr>  <lgl>     <dbl> <dbl> <dbl> <dbl> <dbl>
1        1 F     TT       26.5 R      TRUE          6   8.4  3.2   13.6  46.6
2        1 F     TT       26.3 R      TRUE          6  10.7  2.98  12.7  42.8
3        1 F     TT       26.0 R      TRUE          6   7.9  3.1   13.7  46.3
4        1 F     TT       25.8 R      TRUE          6   6.5  3.17  13.6  46.2
5        1 F     TT       25.6 R      TRUE          6   6.6  3.06  13    44.9
6        1 F     TT       25.5 R      TRUE          6   7.2  2.86  12.7  41.8
# ℹ 41 more variables: MCV <dbl>, MCH <dbl>, MCHC <dbl>, RBCDist <dbl>,
#   NRBC <dbl>, Platelets <dbl>, MPV <dbl>, SEGS <dbl>, ACNeutrophils <dbl>,
#   Lymphs <dbl>, ACLymphocytes <dbl>, Monocytes <dbl>, ACMonocytes <dbl>,
#   EOS <dbl>, ACEosinophils <dbl>, Glucose <dbl>, BUN <dbl>, Creatinine <dbl>,
#   UricAcid <dbl>, Sodium <dbl>, Potassium <dbl>, Chloride <dbl>, CO2 <dbl>,
#   Protein <dbl>, Albumin <dbl>, Calcium <dbl>, InorgPhos <dbl>,
#   AlkPhos <dbl>, LDH <dbl>, AST <dbl>, ALT <dbl>, GGT <dbl>, …
AgeGroup_Summary_Subset_q3 <- subset_q3 %>%
  group_by(agegroup) %>%
  summarise(
    Min_Age = min(Age, na.rm = TRUE),  # Minimal age
    Max_Age = max(Age, na.rm = TRUE),  # Maximal age
    Period = Max_Age - Min_Age         # Period (years followed)
  )
# Display the summary as a clean table
kable(
  AgeGroup_Summary_Subset_q3,
  caption = "Summary of Age Groups in Objective Dataset",
  format = "html"
)
Summary of Age Groups in Objective Dataset
agegroup Min_Age Max_Age Period
1 5.02 9.89 4.87
2 10.00 13.91 3.91
3 14.00 32.91 18.91
4 33.00 37.92 4.92
5 38.02 39.90 1.88
# Display the calculated summary for each animal
#print(AgeGroup_Summary_Subset_q3)

However, in the subset dataset, we realised that there was no age group 1 (fromn age 5 to 9.99), which have led to NA values upon our comparison between these two datasets.

#Join the two dataframe using a outer join
Comparison_table <- merge(x=AgeGroup_Summary_Subset_q3, y=AgeGroup_Summary_Subset34,
                          by="agegroup", all = TRUE)

print(Comparison_table)
  agegroup Min_Age.x Max_Age.x Period.x Min_Age.y Max_Age.y Period.y
1        1      5.02      9.89     4.87        NA        NA       NA
2        2     10.00     13.91     3.91        10        13        3
3        3     14.00     32.91    18.91        14        32       18
4        4     33.00     37.92     4.92        33        37        4
5        5     38.02     39.90     1.88        38        40        2
  • change the x and y values for clarity on the table

Question: “By making a suitable graph of the results of the authors Mid age group (the selection of animals from sd02.xlsx) confirm or decline the conclusion by the authors that females apparently age faster than most males when taking into account hemoglobin content only. Use all data for these dolphins, not just the subset of data listed in sd02.xlsx. Tip: use the ggplot2 functions geom_smooth() to draw straight (!) trend lines and facet_wrap() to make separate plots for males and females.”

library(ggplot2)
#|label: fig-hemoglobin
#|fig-cap: "Hemoglobin Amount: Age by Sex"
library(ggplot2)

# Prepare the plot data
plot_data <- all_dolphins_144 %>%
  filter(!is.na(HGB), !is.na(Age), !is.na(Sex), Age >= 10 & Age <= 40)  # Remove rows with missing values

# Create the plot
ggplot(plot_data, aes(x = Age, y = HGB, color = Sex)) +
  geom_point(alpha = 0.6) +  # Scatter points
  geom_smooth(method = "lm", se = TRUE, color = "black") +  # Black trend lines
  facet_wrap(~Sex) +  # Separate plots for males and females
  labs(
    title = "Hemoglobin Content vs. Age by Sex",
    x = "Age (years)",
    y = "Hemoglobin Content (HGB)"
  ) +
  theme_minimal()

Our analysis reveals that hemoglobin (HGB) content decreases significantly with increasing age in females, exhibiting a steeper decline compared to males. We can therefore confirm that the authors are correct. The trend on the ?@fig-hemoglobin suggests that females may experience a faster reduction in hemoglobin content as they age, aligning with the hypothesis that females age more quickly in terms of hemoglobin levels. In contrast, the decline in hemoglobin content for males appears to be less pronounced, indicating a slower aging process in this biomarker.