Problem 1: In the subset file “sd02”, there are two dedicated columns for conveying “Sex” information of the animals:
One named as “Sex” and given “M” and “F” values.
The other named as “sex2” and given values 0 and 1.
We can use the Sex column (M for Male, F for Female) as the reference and update or create a new column to replace sex2 (0 and 1) with consistent, meaningful labels.
The downloaded binary packages are in
/var/folders/xh/l8syzm3s185d0wjdjzy_lyd00000gn/T//RtmpEfSGUr/downloaded_packages
library(tidyverse)library(datasets)library(readxl)library(readr)library(readxl)library(dplyr)all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")metadata <-read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")subset_34 <- subset_34 %>%mutate(Sex =case_when( Sex =="M"~"Male", # Replace "M" with "Male" Sex =="F"~"Female", # Replace "F" with "Female"TRUE~NA_character_# Handle any unexpected values ),sex2 =case_when( Sex =="Male"~1, # Male is encoded as 1 Sex =="Female"~0, # Female is encoded as 0TRUE~NA_real_# Handle any unexpected values ) )# View the updated datasethead(subset_34)
Problem 2: In both of the datasets, the first 4 rows are used to convey information. These are problematic as they should not be confused with actual data rows during the data analysis processes.
To avoid any mistakes we can either delete these rows from the excel file before starting to the analysis, or skip these rows via the code
all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx")subset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx")metadata <-read.csv("/Users/zilanoz/Desktop/Assignment_1/data/metadata.csv")# Skip the first 4 rows while reading the datasubset_34 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd02.xlsx", skip =4)all_dolphins_144 <- readxl::read_excel("/Users/zilanoz/Desktop/Assignment_1/data/sd01.xlsx", skip =4)
Problem 3: In our dataset, we have two types of variables for each cell type:
Percent values: e.g., SEGS, Lymphs, Monocytes, EOS
These variables are mathematically related, meaning that one set (either the percent values or the absolute counts) is redundant since I can derive one from the other using the total white blood cell count (WBC).
Why is this a problem?
If I have both percent values and absolute counts, there’s a risk that one set of variables might not align with the other, especially if the total WBC count changes or if updates are made inconsistently. For example, if the total WBC count (WBC) is recalculated or adjusted, but the percentages or absolute counts are not updated accordingly, it could result in mismatched or incorrect values. This would undermine the reliability of my analysis.
Although it would be fine with smaller datasets, redundant variables within already large datasets consume unnecessary storage space and may increase processing time.
Having two different ways to represent the same information adds unnecessary complexity to my dataset. During analysis, I would need to decide which variable to use and confirm that it is consistent with the other. For example, if I am comparing cell populations across samples, should I use the absolute counts or the percentages?
To resolve this redundancy:
We candelete one of the redundant variables within the pairs.
We may choose to retain the absolute counts (ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils) because they are more critical for the type of analysis we may intend to perform.
We can recalculate the removed variables when required. I can always calculate percent values (SEGS, Lymphs, Monocytes, EOS) when needed, via following the logic:
Upon examining the dataset (sd01.xlsx), I identified an integrity issue in the column representing subspecies codes. According to the paper, two subspecies of bottlenose dolphins were investigated were Tursiops truncatus truncatus (denoted as TT) and Tursiops truncatus gilli (denoted as TG and TTG). This discrepancy creates ambiguity regarding the classification of subspecies.
The same subspecies may be represented by different codes (TG and TTTG), which violates the principle of consistent data representation, introduces uncertainty, makes it difficult to interpret the dataset accurately or use it for analysis.
In order to resolve this, we can standardise the subspecies codes to avoid ambiguity. For example:
Use TT for Tursiops truncatus truncatus.
Use TTG (or another single code) for Tursiops truncatus gilli
To resolve this issue, I propose standardizing the Species column by merging TTG into TG (or vice versa). My reasoning for this modification is based on:
Evidence from the Paper: The paper explicitly states that only two subspecies were studied.
#to change the levelsall_dolphins_144 <- all_dolphins_144 %>%mutate(Species =ifelse(Species =="TG", "TTG", Species))# to verify the presence of the TT and TTG in the dataunique(all_dolphins_144$Species)
[1] "TT" "TTTG" "TTG"
# to calculate counts and percentages for each speciesspecies_counts <- all_dolphins_144 %>%count(Species) %>%mutate(Percentage = (n /sum(n)) *100)# print the resultprint(species_counts)
# A tibble: 3 × 3
Species n Percentage
<chr> <int> <dbl>
1 TT 5544 94.1
2 TTG 90 1.53
3 TTTG 255 4.33
#checking for NA valuesna_species <-sum(is.na(all_dolphins_144$Species))cat("Number of missing values in 'Species':", na_species, "\n")
Number of missing values in 'Species': 0
However, it seems that the percentage doesn’t match with that of mentioned in the paper, as TT seems to be %94.14 and TTTG (the level that we are unsure of what it means) is %4.33. Considering that their total adds up to the %98.5 mentioned as the percentage of the TT on the article, our assumption of the TTG belonging to TTTG is wrong, because the data is misleading. Instead, TTTG belongs to TT Species.
In order to fix this, we would need to change the “TTTG” levels to “TT” levels.
# A tibble: 2 × 3
Species n Percentage
<chr> <int> <dbl>
1 TT 5799 98.5
2 TTG 90 1.53
Question:Display the top 6 lines of each of your new data frames.You don’t have to number these tables or to add table captions.
library(knitr) # For kablelibrary(DT) # For interactive datatables# creating a dataset for absolute countsall_dolphins_144_absolute <- all_dolphins_144 %>%select(-c(SEGS, Lymphs, Monocytes, EOS))# creating a dataset for percentagesall_dolphins_144_percentage <- all_dolphins_144 %>%select(-c(ACNeutrophils, ACLymphocytes, ACMonocytes, ACEosinophils))# displaying the first 6 rows of the absolute counts dataset as a styled tablecat("### Table 1: Absolute Counts Table\n")
### Table 1: Absolute Counts Table
kable(head(all_dolphins_144_absolute), caption ="First 6 Rows of Absolute Counts Dataset")
First 6 Rows of Absolute Counts Dataset
AnimalID
Sex
Species
Age
Reason
Fasting
LabCode
WBC
RBC
HGB
HCT
MCV
MCH
MCHC
RBCDist
NRBC
Platelets
MPV
ACNeutrophils
ACLymphocytes
ACMonocytes
ACEosinophils
Glucose
BUN
Creatinine
UricAcid
Sodium
Potassium
Chloride
CO2
Protein
Albumin
Calcium
InorgPhos
AlkPhos
LDH
AST
ALT
GGT
Bilirubin
Cholesterol
Triglyceride
Iron
CPK
SED60
Mg
GFR
1
F
TT
26.50
R
TRUE
6
8.4
3.20
13.6
46.6
145.6
42.5
29.2
14.1
0
69
NA
4.536
1.344
0.588
1.932
95
35
1.3
0
156
3.9
120
27
7.0
5.2
8.9
6.4
228
464
145
23
28
0.0
182
68
153
162
NA
1.8
221
1
F
TT
26.26
R
TRUE
6
10.7
2.98
12.7
42.8
143.6
42.6
29.7
14.4
0
80
NA
6.527
2.247
0.642
1.284
77
43
1.2
0
157
4.2
119
25
6.8
5.1
9.0
6.1
204
530
128
18
25
0.0
176
48
109
142
4
1.7
243
1
F
TT
26.02
R
TRUE
6
7.9
3.10
13.7
46.3
149.4
44.2
29.6
14.7
1
89
NA
4.503
1.501
0.474
1.422
95
28
1.3
0
154
3.7
118
26
6.7
5.2
8.9
5.3
214
445
162
22
23
0.1
214
46
167
118
4
2.1
221
1
F
TT
25.77
R
TRUE
6
6.5
3.17
13.6
46.2
145.7
42.9
29.4
13.3
0
59
NA
4.225
1.690
0.260
0.325
98
42
1.1
0
153
4.3
119
24
6.1
4.9
8.6
5.8
184
420
154
22
24
0.0
204
73
115
113
1
1.9
268
1
F
TT
25.56
R
TRUE
6
6.6
3.06
13.0
44.9
146.7
42.5
29.0
15.6
0
68
NA
3.696
1.056
0.396
1.452
83
34
1.3
0
153
3.6
118
26
6.4
5.2
8.7
5.6
204
348
176
24
24
0.2
194
69
203
109
4
2.2
221
1
F
TT
25.52
R
TRUE
6
7.2
2.86
12.7
41.8
146.2
44.4
30.4
15.2
0
56
NA
3.384
1.152
0.576
2.088
98
44
1.1
0
155
4.0
124
24
5.9
5.1
8.4
5.9
206
468
169
25
23
0.0
201
84
158
156
2
1.8
268
# displaying the first 6 rows of the percentages dataset as a styled tablecat("\n### Table 2: Percentages Table\n")
### Table 2: Percentages Table
kable(head(all_dolphins_144_percentage), caption ="First 6 Rows of Percentages Dataset")
First 6 Rows of Percentages Dataset
AnimalID
Sex
Species
Age
Reason
Fasting
LabCode
WBC
RBC
HGB
HCT
MCV
MCH
MCHC
RBCDist
NRBC
Platelets
MPV
SEGS
Lymphs
Monocytes
EOS
Glucose
BUN
Creatinine
UricAcid
Sodium
Potassium
Chloride
CO2
Protein
Albumin
Calcium
InorgPhos
AlkPhos
LDH
AST
ALT
GGT
Bilirubin
Cholesterol
Triglyceride
Iron
CPK
SED60
Mg
GFR
1
F
TT
26.50
R
TRUE
6
8.4
3.20
13.6
46.6
145.6
42.5
29.2
14.1
0
69
NA
54
16
7
23
95
35
1.3
0
156
3.9
120
27
7.0
5.2
8.9
6.4
228
464
145
23
28
0.0
182
68
153
162
NA
1.8
221
1
F
TT
26.26
R
TRUE
6
10.7
2.98
12.7
42.8
143.6
42.6
29.7
14.4
0
80
NA
61
21
6
12
77
43
1.2
0
157
4.2
119
25
6.8
5.1
9.0
6.1
204
530
128
18
25
0.0
176
48
109
142
4
1.7
243
1
F
TT
26.02
R
TRUE
6
7.9
3.10
13.7
46.3
149.4
44.2
29.6
14.7
1
89
NA
57
19
6
18
95
28
1.3
0
154
3.7
118
26
6.7
5.2
8.9
5.3
214
445
162
22
23
0.1
214
46
167
118
4
2.1
221
1
F
TT
25.77
R
TRUE
6
6.5
3.17
13.6
46.2
145.7
42.9
29.4
13.3
0
59
NA
65
26
4
5
98
42
1.1
0
153
4.3
119
24
6.1
4.9
8.6
5.8
184
420
154
22
24
0.0
204
73
115
113
1
1.9
268
1
F
TT
25.56
R
TRUE
6
6.6
3.06
13.0
44.9
146.7
42.5
29.0
15.6
0
68
NA
56
16
6
22
83
34
1.3
0
153
3.6
118
26
6.4
5.2
8.7
5.6
204
348
176
24
24
0.2
194
69
203
109
4
2.2
221
1
F
TT
25.52
R
TRUE
6
7.2
2.86
12.7
41.8
146.2
44.4
30.4
15.2
0
56
NA
47
16
8
29
98
44
1.1
0
155
4.0
124
24
5.9
5.1
8.4
5.9
206
468
169
25
23
0.0
201
84
158
156
2
1.8
268
Part II. Checking Mid age population, results and conclusions
Question: “Calculate the minimal age, maximal age and number of years (period) over which each of the Mid age animals (the ones from table sd02.xlsx) were followed. Also give a summary of these characteristics using the summary() function. You don’t have to make a proper table of the output of the summary() function.”
# calculate minimal, maximal age, and period for each AnimalID2age_summary <- subset_34 %>%group_by(AnimalID2) %>%summarise(Min_Age =min(Age, na.rm =TRUE), # Minimal ageMax_Age =max(Age, na.rm =TRUE), # Maximal agePeriod = Max_Age - Min_Age # Period (years followed) )# the calculated summary for each animalprint(age_summary)
AnimalID2 Min_Age Max_Age Period
Min. : 1.00 Min. :10 Min. :24.00 Min. :11.00
1st Qu.:13.25 1st Qu.:12 1st Qu.:31.00 1st Qu.:16.00
Median :25.00 Median :14 Median :33.00 Median :21.00
Mean :26.09 Mean :14 Mean :32.44 Mean :18.44
3rd Qu.:39.75 3rd Qu.:16 3rd Qu.:34.75 3rd Qu.:21.00
Max. :53.00 Max. :21 Max. :40.00 Max. :21.00
The AnimalID2 values range from 1 to 53, indicating the unique identifiers for the animals studied.
Min_Age
Minimum Age: The youngest starting age in the dataset is 10 years.
Maximum Age: The oldest starting age is 21 years.
Max_Age
Minimum Max Age: The youngest ending age is 24 years, while the oldest is 40 years.
Mean Max Age: The average maximum age is 32.44 years.
Period
Minimum Period: The shortest follow-up period is 11 years, while the longest is 21 years.
Mean Period: On average, animals were followed for 18.44 years.
Median Period: The median follow-up period is 21 years, indicating that most animals were observed for close to two decades.
Question: “Formulate objective selection criteria for membership of the Mid age group based on the outcome of your previous analysis. Keep in mind that the goal is: [1] To make observations in early adulthood (adulthood starts between 5 and 12 years, earlier with females than males). [2] To follow these animals at least for a significant number of years into adulthood and “old age”. For wild bottlenose dolphins 25 years is old, but not when they live in captivity. [3] Not to deviate too much from the apparent criteria used by the authors. [4] For simplicity, use the same criteria for males and females.”
Answer: We found that the age of the animals when they first started to be measured was when they were 10 years old and the maximum age was when they were 21 years old. This does not align with the first goal of keeping this range within 5-12 years of age.
Our suggestion is to align with the objective selection criteria of early adulthood by selecting 5 years as the minimum eligible age group to be included in the subset. Also keeping the 40 years of age as a maximal age for the Animal’s data to be included in the sample.
Question: “Apply these criteria on all 144 dolphins to create your own Objective mid age selection and register your selection in a separate column. Again, display the first 6 lines of the modified data table. You don’t have to number this table or to add table captions”
#|label: tab-taba#|tab-cap: "Filtered Dataset: Ages Between 5 and 40"# Filter the dataset to include ages between 5 and 40subset_q3 <- all_dolphins_144 %>%filter(Age >=5& Age <=40)# Display the first few rows of the filtered dataset as a tablelibrary(knitr)cat("### Filtered Dataset: Ages Between 5 and 40\n")
### Filtered Dataset: Ages Between 5 and 40
kable(head(subset_q3), caption ="First Few Rows of Filtered Dataset")
First Few Rows of Filtered Dataset
AnimalID
Sex
Species
Age
Reason
Fasting
LabCode
WBC
RBC
HGB
HCT
MCV
MCH
MCHC
RBCDist
NRBC
Platelets
MPV
SEGS
ACNeutrophils
Lymphs
ACLymphocytes
Monocytes
ACMonocytes
EOS
ACEosinophils
Glucose
BUN
Creatinine
UricAcid
Sodium
Potassium
Chloride
CO2
Protein
Albumin
Calcium
InorgPhos
AlkPhos
LDH
AST
ALT
GGT
Bilirubin
Cholesterol
Triglyceride
Iron
CPK
SED60
Mg
GFR
1
F
TT
26.50
R
TRUE
6
8.4
3.20
13.6
46.6
145.6
42.5
29.2
14.1
0
69
NA
54
4.536
16
1.344
7
0.588
23
1.932
95
35
1.3
0
156
3.9
120
27
7.0
5.2
8.9
6.4
228
464
145
23
28
0.0
182
68
153
162
NA
1.8
221
1
F
TT
26.26
R
TRUE
6
10.7
2.98
12.7
42.8
143.6
42.6
29.7
14.4
0
80
NA
61
6.527
21
2.247
6
0.642
12
1.284
77
43
1.2
0
157
4.2
119
25
6.8
5.1
9.0
6.1
204
530
128
18
25
0.0
176
48
109
142
4
1.7
243
1
F
TT
26.02
R
TRUE
6
7.9
3.10
13.7
46.3
149.4
44.2
29.6
14.7
1
89
NA
57
4.503
19
1.501
6
0.474
18
1.422
95
28
1.3
0
154
3.7
118
26
6.7
5.2
8.9
5.3
214
445
162
22
23
0.1
214
46
167
118
4
2.1
221
1
F
TT
25.77
R
TRUE
6
6.5
3.17
13.6
46.2
145.7
42.9
29.4
13.3
0
59
NA
65
4.225
26
1.690
4
0.260
5
0.325
98
42
1.1
0
153
4.3
119
24
6.1
4.9
8.6
5.8
184
420
154
22
24
0.0
204
73
115
113
1
1.9
268
1
F
TT
25.56
R
TRUE
6
6.6
3.06
13.0
44.9
146.7
42.5
29.0
15.6
0
68
NA
56
3.696
16
1.056
6
0.396
22
1.452
83
34
1.3
0
153
3.6
118
26
6.4
5.2
8.7
5.6
204
348
176
24
24
0.2
194
69
203
109
4
2.2
221
1
F
TT
25.52
R
TRUE
6
7.2
2.86
12.7
41.8
146.2
44.4
30.4
15.2
0
56
NA
47
3.384
16
1.152
8
0.576
29
2.088
98
44
1.1
0
155
4.0
124
24
5.9
5.1
8.4
5.9
206
468
169
25
23
0.0
201
84
158
156
2
1.8
268
# Summary of the Age variablecat("\n### Summary of Age Variable\n")
### Summary of Age Variable
summary(subset_q3$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.02 14.66 22.11 22.30 30.32 39.90
Question: “Summarize similarities and differences between the authors Mid age group and the Objective mid age group in a table”.
AgeGroup_Summary_Subset34 <- subset_34 %>%group_by(agegroup) %>%summarise(Min_Age =min(Age, na.rm =TRUE), # Minimal ageMax_Age =max(Age, na.rm =TRUE), # Maximal agePeriod = Max_Age - Min_Age # Period (years followed) )# Display the summary as a clean tablekable( AgeGroup_Summary_Subset34,caption ="Summary of Age Groups in Subset of 34 Dolphins",format ="html")
Summary of Age Groups in Subset of 34 Dolphins
agegroup
Min_Age
Max_Age
Period
2
10
13
3
3
14
32
18
4
33
37
4
5
38
40
2
# Display the calculated summary for each animal#print(AgeGroup_Summary_Subset34)
Since there is no Age Group Column in the sd01 dataset, we proceeded to create a new column within the Objective Dataset (named as “subset_q3”), to group the ages from 1 to 5 as it was similarly the case in the sd02 dataset.
subset_q3 <- subset_q3 %>%mutate(agegroup =case_when( Age >=5& Age <=9.99~1, Age >=10& Age <=13.99~2, Age >=14& Age <=32.99~3, Age >=33& Age <=37.99~4, Age >=38& Age <=40~5,TRUE~NA_real_# Assign NA for ages outside these ranges ) )# View the updated datasethead(subset_q3)
# A tibble: 6 × 52
AnimalID Sex Species Age Reason Fasting LabCode WBC RBC HGB HCT
<dbl> <chr> <chr> <dbl> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 F TT 26.5 R TRUE 6 8.4 3.2 13.6 46.6
2 1 F TT 26.3 R TRUE 6 10.7 2.98 12.7 42.8
3 1 F TT 26.0 R TRUE 6 7.9 3.1 13.7 46.3
4 1 F TT 25.8 R TRUE 6 6.5 3.17 13.6 46.2
5 1 F TT 25.6 R TRUE 6 6.6 3.06 13 44.9
6 1 F TT 25.5 R TRUE 6 7.2 2.86 12.7 41.8
# ℹ 41 more variables: MCV <dbl>, MCH <dbl>, MCHC <dbl>, RBCDist <dbl>,
# NRBC <dbl>, Platelets <dbl>, MPV <dbl>, SEGS <dbl>, ACNeutrophils <dbl>,
# Lymphs <dbl>, ACLymphocytes <dbl>, Monocytes <dbl>, ACMonocytes <dbl>,
# EOS <dbl>, ACEosinophils <dbl>, Glucose <dbl>, BUN <dbl>, Creatinine <dbl>,
# UricAcid <dbl>, Sodium <dbl>, Potassium <dbl>, Chloride <dbl>, CO2 <dbl>,
# Protein <dbl>, Albumin <dbl>, Calcium <dbl>, InorgPhos <dbl>,
# AlkPhos <dbl>, LDH <dbl>, AST <dbl>, ALT <dbl>, GGT <dbl>, …
AgeGroup_Summary_Subset_q3 <- subset_q3 %>%group_by(agegroup) %>%summarise(Min_Age =min(Age, na.rm =TRUE), # Minimal ageMax_Age =max(Age, na.rm =TRUE), # Maximal agePeriod = Max_Age - Min_Age # Period (years followed) )# Display the summary as a clean tablekable( AgeGroup_Summary_Subset_q3,caption ="Summary of Age Groups in Objective Dataset",format ="html")
Summary of Age Groups in Objective Dataset
agegroup
Min_Age
Max_Age
Period
1
5.02
9.89
4.87
2
10.00
13.91
3.91
3
14.00
32.91
18.91
4
33.00
37.92
4.92
5
38.02
39.90
1.88
# Display the calculated summary for each animal#print(AgeGroup_Summary_Subset_q3)
However, in the subset dataset, we realised that there was no age group 1 (fromn age 5 to 9.99), which have led to NA values upon our comparison between these two datasets.
#Join the two dataframe using a outer joinComparison_table <-merge(x=AgeGroup_Summary_Subset_q3, y=AgeGroup_Summary_Subset34,by="agegroup", all =TRUE)print(Comparison_table)
change the x and y values for clarity on the table
Question: “By making a suitable graph of the results of the authors Mid age group (the selection of animals from sd02.xlsx) confirm or decline the conclusion by the authors that females apparently age faster than most males when taking into account hemoglobin content only. Use all data for these dolphins, not just the subset of data listed in sd02.xlsx. Tip: use the ggplot2 functions geom_smooth() to draw straight (!) trend lines and facet_wrap() to make separate plots for males and females.”
library(ggplot2)#|label: fig-hemoglobin#|fig-cap: "Hemoglobin Amount: Age by Sex"library(ggplot2)# Prepare the plot dataplot_data <- all_dolphins_144 %>%filter(!is.na(HGB), !is.na(Age), !is.na(Sex), Age >=10& Age <=40) # Remove rows with missing values# Create the plotggplot(plot_data, aes(x = Age, y = HGB, color = Sex)) +geom_point(alpha =0.6) +# Scatter pointsgeom_smooth(method ="lm", se =TRUE, color ="black") +# Black trend linesfacet_wrap(~Sex) +# Separate plots for males and femaleslabs(title ="Hemoglobin Content vs. Age by Sex",x ="Age (years)",y ="Hemoglobin Content (HGB)" ) +theme_minimal()
Our analysis reveals that hemoglobin (HGB) content decreases significantly with increasing age in females, exhibiting a steeper decline compared to males. We can therefore confirm that the authors are correct. The trend on the ?@fig-hemoglobin suggests that females may experience a faster reduction in hemoglobin content as they age, aligning with the hypothesis that females age more quickly in terms of hemoglobin levels. In contrast, the decline in hemoglobin content for males appears to be less pronounced, indicating a slower aging process in this biomarker.