I'll try to interweave results - tabular and graphical - for the things you've highlighted in the revised outline (dated 28 May 2014).
First thing we'll look at are the numbers from the two comparisons we make for resting and for available females. This comparison is to look at the health of successful vs. unsuccessful females. I'll show the plots, the statistics, and the tallies for the different groups.
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:lubridate':
##
## here
##
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, desc, failwith, id, mutate, summarise, summarize
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
To make the comparisons, we need to loop over years, and then find out two groups of individuals: 1) those available in a given year and pregnant the next year; and 2) those available in a given year, but still available in the next year.
Those data look like this:
head(pchealth)
## year navail npregNext nhealth phealth
## 1 1985 12 4 75.30 70.88
## 2 1986 7 2 75.45 82.96
## 3 1987 13 7 78.65 74.77
## 4 1988 11 7 81.15 82.63
## 5 1989 6 4 74.18 79.94
## 6 1990 16 7 73.58 79.34
In the pchealth data frame, the variables are as follows:
navail is the sum of 3 to 2, and 3 to 4 animals, i.e. resting to pregnant, resting to available, respectively.npregNext is the sum of the 3 to 2 animals.phealth is the health of the animals who became pregnantnphealth is the health of the animals who DID NOT became pregnantWe can plot the numbers easily to see how many are in each group:
## pdf
## 2
I'll write these out to a csv file for further comparison:
write.csv(pchealth, file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/data/Summary_RestingPregnantComparison.csv")
But we also want to see if these are a) different. To do that, we should work with the raw data, not the mean summaries. So we need to grab those. I'll do this by looping over year as before, but instead of grabbing the mean, I'll just grab the whole year's health values:
This yields a data frame where each row has one individual's health for a given year, along with its status. Here's what the first few lines of the data look like:
## EGNo Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1118 1118 74.48 72.93 71.76 70.96 70.53 70.34 70.39 70.74 71.36 72.25
## 1127 1127 68.13 68.80 69.70 70.48 71.11 71.66 72.14 72.51 72.82 73.07
## 1135 1135 65.68 65.48 65.30 65.66 66.01 66.41 66.91 67.54 68.35 69.32
## 1157 1157 68.20 68.79 69.59 70.69 71.65 72.48 73.19 73.84 74.47 75.06
## 1142 1142 69.42 70.91 72.63 74.71 76.49 78.05 79.32 80.31 81.13 81.66
## 1145 1145 69.99 70.58 71.42 72.19 72.85 73.45 74.02 74.46 74.85 75.16
## Nov Dec year status
## 1118 73.06 73.85 1985 32
## 1127 73.20 73.33 1985 32
## 1135 70.14 70.81 1985 32
## 1157 75.66 76.22 1985 32
## 1142 82.05 82.29 1985 34
## 1145 75.49 75.76 1985 34
In the above, status of 32 means they were resting one year, and pregnant the next. Status of 34 means resting one year, but available the next.
Let's plot out the densities of these to check for normality:
## pdf
## 2
They seem to be left skewed to me. But here's the t.test comparing their means (n.b. I looked in R's help, and these 95% CIs are for the sample estimates as we presumed yesterday) are. You can see that the ones that transition to pregnant are healthier:
t.test(as.vector(pchealth[pchealth$status == 32, 2:13]), as.vector(pchealth[pchealth$status ==
34, 2:13]))
##
## Welch Two Sample t-test
##
## data: as.vector(pchealth[pchealth$status == 32, 2:13]) and as.vector(pchealth[pchealth$status == 34, 2:13])
## t = 6.231, df = 3434, p-value = 5.213e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.261 2.418
## sample estimates:
## mean of x mean of y
## 73.12 71.28
I'll write out the file for you to have the raw data.
write.csv(pchealth, file = "RestingPregnantComparisonRawData.csv")
Now let's look at the available females in the same way.
## year navail npregNext nhealth phealth
## 1 1985 10 2 56.61 79.23
## 2 1986 13 3 77.94 79.30
## 3 1987 14 6 78.83 79.94
## 4 1988 14 3 76.53 75.41
## 5 1989 14 9 69.55 80.93
## 6 1990 7 2 58.21 78.00
In the avchealth data frame, the variables are as follows:
navail is the sum of 4 to 2, and 4 to 4 animals, i.e. available to pregnant, available to available, respectively.npregNext is the sum of the 4 to 2 animals.phealth is the health of the animals who became pregnantnphealth is the health of the animals who DID NOT became pregnantI'll write out the file for you:
write.csv(avchealth, file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/data/Summary_AvailablePregnantComparison.csv")
And then we can plot the numbers in each category by year.
## pdf
## 2
And plot the yearly mean scores of health:
## pdf
## 2
But we want to look at the raw data as well so we'll repeat that loop storing the actual health, not the means.
So let's try to improve upon that previous line plot of the summarised data (by year) with some paired boxplots of the raw data. Personally I don't find this effective, but here it is:
dfLong <- melt(avchealth, id.vars = c("year", "status"), measure.vars = 2:13)
p <- ggplot(dfLong, aes(factor(year), value)) + geom_boxplot(aes(fill = factor(status),
group = paste(factor(status), year))) + labs(y = "Estimated Health", x = "Year",
fill = "Reproductive Status") + scale_fill_grey(start = 1, end = 0.65, labels = c("Av - Preg",
"Av - Av")) + theme_bw()
p
pdf(file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/images/HealthSummary_AvailAvailPregnantPairedBoxPlots.pdf")
p
dev.off()
## pdf
## 2
And on that dataset we'll run the t-test:
t.test(as.vector(avchealth[avchealth$status == 42, 2:13]), as.vector(avchealth[avchealth$status ==
44, 2:13]))
##
## Welch Two Sample t-test
##
## data: as.vector(avchealth[avchealth$status == 42, 2:13]) and as.vector(avchealth[avchealth$status == 44, 2:13])
## t = 12.92, df = 5538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.868 5.252
## sample estimates:
## mean of x mean of y
## 74.55 69.99
As before, there are significant differences in the ones who transition to pregnant; they are healthier than those who transition to available.
Here are the density plots of these data:
## pdf
## 2
These too are left skewed, and I'll write them out to you for further comparison.
write.csv(avchealth, file = "AvailablePregnantHealthComparisonsRawData.csv")
We saw a dip in health of available animals towards the beginning of 2007. We'd like to see what's the cause of that. First let's grab the list of those animals, and then plot out their individual health.
So you can see from the plot that there are three animals that are estimated to be dead. These are:
badHealth <- unique(which(df[, 1:12] < 5, arr.ind = T)[, 1])
df[badHealth, "ID"]
## [1] 1248 2420 1013
## 30 Levels: 1012 1013 1140 1142 1145 1158 1175 1204 1233 1240 1241 ... 3010
Here let's plot out a scatter plot of health of those available in the current year vs. the ratio of # of calves next year / # available females this year. The filled circles is the health of those who got pregnant; the empty circles are those who didn't:
## pdf
## 2
So the goal here is to just look at the distribution of health in the year two years prior to having a calf. n.b. that we're including the first calf here, which is different from the previous examinations of available.
First thing we need to do is grab the data and mold it into a useful form. The logic here is that we'll take the master data frame summary containing all individual health profiles, find the times that are the two years prior, and then set everything else to NA.
p.idx <- which(calves == 1, arr.ind = TRUE) # this is the pregnancy year
aidx <- p.idx
aidx[, 2] <- p.idx[, 2] - 24 # this is the pregnancy year minus two years
aidx <- aidx[which(aidx[, 2] > 0), ]
pop.idx <- aidx
pophealth <- sumh/ng
psub <- matrix(NA, dim(pophealth)[1], dim(pophealth)[2])
psub[pop.idx] <- pophealth[pop.idx]
psub[psub < 5] <- NA # This is to remove health of females that are before firstSight
quantile(psub, c(0.05, 0.25, 0.5, 0.75, 0.95), na.rm = TRUE)
## 5% 25% 50% 75% 95%
## 54.46 71.27 77.89 82.52 87.42
And then we can visualise the histogram and density plot:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## pdf
## 2
Ok, let's look at the population level health scores. First, I need to factor out the young juveniles, and then marry up the juvenile times with the adult times for a consecutive health profile.
## SightingEGNo min_date max_date MinDate MaxDate MinDateInt MaxDateInt
## 1 1005 1993-08-27 1997-08-23 8-1993 8-1997 284 332
## 2 1017 1988-04-03 2011-03-17 4-1988 3-2011 220 495
## 3 1019 1988-05-05 2009-07-18 5-1988 7-2009 221 475
## 4 1021 1988-04-03 1998-08-04 4-1988 8-1998 220 344
## 5 1026 1989-08-28 2009-09-03 8-1989 9-2009 176 477
## 6 1028 1988-09-02 2008-10-12 9-1988 10-2008 225 466
Then let's get the population health data set up using these individual and time indices
And then plotted.
We also wanted to try this plot with lines for the variance instead of ribbons. That looks like this:
## pdf
## 2
I think this looks better. I like the ribbon, but if we include the date annotation rectangles, then these are better. I'll continue this style below.
Ok, then what are the summary stats of these data, and what do they look like? Fivenum summary contains: minimum, lower-hinge, median, upper-hinge, maximum.
## [1] 0.00 68.34 78.60 84.09 88.81
Note that the 0's are likely from animals that will be estimated as dead. We can talk about pulling out the health values of the animals that the model has estimated to be dead.
Finally, let's add the calving #'s on top of the health plot to round out the picture a little bit.
## pdf
## 2
Ok, that was the population level health. What about some of the sub-population categories? We can turn to these now. I'll leave some of the code out since it's just a repeat of what we did to generate the population level values. We need summaries for:
## pdf
## 2
## pdf
## 2
## pdf
## 2
And now let's look at the different reproductive Female categories. For the pregnant females, I can use the existing time query, but I'll alter it for lactating, resting, and available per our discussions over email. Here are the pregnant females:
## pdf
## 2
Now let's adjust the query for the other times in the reproductive life cycle.
## pdf
## 2
## pdf
## 2
## pdf
## 2
## Warning: row names were found from a short variable and have been
## discarded
The correlation between health of all available females in one year and calves/year (in the following year) is:
cvec <- table(calfTable$Calving.Year)
cvec <- cvec[which(names(cvec) == 1986):(length(cvec) - 1)]
health <- dfavfem[dfavfem$category == "avfemHealth", -c(1, ncol(dfavfem))]
cvs <- data.frame(calfyr = names(cvec), numcalf = cvec)
hvs <- data.frame(hyear = names(health), hvals = t(health))
df <- cbind(hvs[1:24, ], cvs)
head(df)
## hyear X1 calfyr numcalf
## 1984 1984 77.35 1986 13
## 1985 1985 77.48 1987 11
## 1986 1986 78.25 1988 7
## 1987 1987 78.65 1989 16
## 1988 1988 80.19 1990 12
## 1989 1989 78.76 1991 17
cor(df$X1, df$numcalf, use = "complete.obs")
## [1] -0.1004
Note that I've done this quickly, and I'm somewhat tired. So I'll need to go back and check the logic. But it looks right so far. It's weak, but I still like scott's line of thinking about making the link (at least in discussion form) about this and pop consequences.
So now we want to examine the median health of animals in different sub groups during 3 time periods:
Let's set these up as indices we can recycle:
idx92 <- which(monYr[, 2] == 1992 | monYr[, 2] == 1993 | monYr[, 2] == 1994)
idx97 <- which(monYr[, 2] == 1997 | monYr[, 2] == 1998 | monYr[, 2] == 1999)
idx04 <- which(monYr[, 2] == 2004 | monYr[, 2] == 2005 | monYr[, 2] == 2006)
First let's look at the population level health, and show it with density plots of the yearly average
## pdf
## 2
Then we'll look at Adult Males, juveniles, pregnant, lactating, resting, & available females.
## pdf
## 2
## pdf
## 2
## pdf
## 2
## pdf
## 2
## pdf
## 2
## pdf
## 2
## pdf
## 2
Ok, we need to make the data to write to excel. What we want is for each row to correspond to a population or sub-population category, and then each column to have the yearly summary of health (with CIs). I'll copy this file, along with the others, to a “data” folder in the main dropbox paper folder.
dfAll <- rbind(dfPop, dfadMales, dfyjuves, dfojuves, dfpregfem, dflacfem, dfrestfem,
dfavfem)
## Error: numbers of columns of arguments do not match
write.csv(dfAll, file = "YearlyHealthSummariesByGroup.csv")
## Error: object 'dfAll' not found
Ok, this is just a one-off test of doing a comparison of the 'before' and 'after' test for one individual. In this case the befores will be either resting or available years, and the after will be pregnant years. So 1245 has had 4 calves in these years:
calfTable[calfTable$EGNo == 1245, ]
## EGNo Calving.Year CalfNo CalvingInterval
## 167 1245 1996 2645 NA
## 168 1245 2001 3145 5
## 169 1245 2005 3545 4
## 170 1245 2008 3845 3
idx <- which(ID == 1245)
df <- data.frame(status = newCalves[idx, ], dates = as.Date(paste(monYr[, 1],
1, monYr[, 2], sep = "/"), format = "%m/%d/%Y"))
df <- df[which(df$status != 0), ]
p <- ggplot(df, aes(dates, status, group = status, colour = factor(status))) +
geom_point() + labs(title = paste("EGNo: ", ID[idx], sep = ""), colour = "status") +
scale_colour_discrete(labels = c("Calving", "Pregnant", "Resting", "Available"))
p
Ok, let's set up the vectors to get her health in these periods. What I will define as the successful transitions are either resting to pregnant (3 to 2) or available to pregnant (4 to 2). The unsuccessful transitions will be resting to available (3 to 4) or available to available (4 to 4).
Ok, now that we have the indices set up, let's get the health enumerated.
Phew. Now that that is behind us, let's examine health in good versus bad years for EGNo 1245:
t.test(success, failure)
##
## Welch Two Sample t-test
##
## data: success and failure
## t = 3.728, df = 45.73, p-value = 0.0005294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.386 24.725
## sample estimates:
## mean of x mean of y
## 77.11 61.05
Since the above proved successful, let's expand it to look at all individual females.
femID <- unique(sights[which(sights$Age == "A" & sights$GenderCode == "F"),
"SightingEGNo"])
outvals <- numeric(0)
for (j in 1:length(femID)) {
idx <- femID[j]
ridx <- which(row.names(newCalves) == idx)
df <- data.frame(year = monYr[, 2], status = newCalves[ridx, ], change = newCalves[ridx,
] - lead(newCalves[ridx, ]))
rpidx <- monYr[which(df$status == 3 & df$change == 1), 2]
apidx <- monYr[which(df$status == 4 & df$change == 2), 2]
raidx <- monYr[which(df$status == 3 & df$change == -1), 2]
avavidx <- numeric(0)
yrseq <- 1980:2009
for (i in 1:length(yrseq)) {
yidx <- which(monYr[, 2] == yrseq[i])
nyidx <- which(monYr[, 2] == yrseq[i + 1])
cidx <- which(newCalves[ridx, yidx] == 4 & newCalves[ridx, nyidx] ==
4, arr.ind = TRUE)
if (length(cidx) == 0) {
next
} else {
avavidx <- c(avavidx, yrseq[i])
}
}
if (length(c(rpidx, apidx)) < 1 | length(c(raidx, avavidx)) < 1) {
next
}
pophealth <- sumh/ng
pophealth[which(pophealth < 5)] <- NA
if (length(rpidx) > 0) {
rphealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] ==
rpidx)], status = 32)
}
if (length(apidx) > 0) {
aphealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] ==
apidx)], status = 42)
}
if (length(rpidx) > 0 & length(apidx) > 0) {
success <- c(rphealth$health, aphealth$health)
}
if (length(rpidx) > 0 & length(apidx) == 0) {
success <- c(rphealth$health)
}
if (length(rpidx) == 0 & length(apidx) > 0) {
success <- c(aphealth$health)
}
if (length(raidx) > 0) {
rahealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] ==
raidx)], status = 34)
}
if (length(avavidx) > 0) {
aahealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] ==
avavidx)], status = 44)
}
if (length(avavidx) > 0 & length(raidx) > 0) {
failure <- c(rahealth$health, aahealth$health)
}
if (length(avavidx) > 0 & length(raidx) == 0) {
failure <- c(aahealth$health)
}
if (length(avavidx) == 0 & length(raidx) > 0) {
failure <- c(rahealth$health)
}
res <- t.test(success, failure)
dfout <- c(res$statistic, res$parameter, res$p.value, res$estimate, res$conf.int[1],
res$conf.int[2], femID[j])
outvals <- rbind(outvals, dfout)
}
colnames(outvals) <- c("t-statistic", "DegFreedom", "p.value", "X - Mean", "Y - Mean",
"CI - Success", "CI - Failure", "EGNo")
write.csv(outvals, file = "pairedHealthComparison.csv")
Goal here is to plot out the average monthly health over all years to see if we can generate some sort of climatology type picture. Let's prep the data first
pop.idx <- popAll.idx
pophealth <- sumh/ng
for (i in 1:nrow(pop.idx)) {
pophealth[rownames(pophealth) == pop.idx$SightingEGNo[i], 1:(pop.idx$MinDateInt[i] -
1)] <- NA # before first sight
pophealth[rownames(pophealth) == pop.idx$SightingEGNo[i], (pop.idx$MaxDateInt[i] +
1):ncol(pophealth)] <- NA # after last time
}
pophealth[which(pophealth < 5)] <- NA
mhealth <- pophealth[gender == "M", ]
femhealth <- pophealth[gender == "F", ]
mvec <- rep(NA, 12)
femquant <- mquant <- matrix(NA, nrow = 3, ncol = 12)
femvec <- rep(NA, 12)
yrs <- unique(monYr[, 2])
for (i in 1:12) {
cols <- which(monYr[, 1] == i)
mvec[i] <- median(mhealth[, cols], na.rm = TRUE)
femvec[i] <- median(femhealth[, cols], na.rm = TRUE)
femquant[, i] <- quantile(femhealth[, cols], c(0.05, 0.5, 0.95), na.rm = TRUE)
mquant[, i] <- quantile(mhealth[, cols], c(0.05, 0.5, 0.95), na.rm = TRUE)
}
And then visualise that
plot(femvec, type = "b", col = "grey60", ylim = c(78, 78.5))
lines(mvec, col = "grey20")
text(6, 78.35, "Males")
text(6, 78.1, "Females")
pdf(file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/images/HealthClimatology.pdf")
plot(femvec, type = "b", col = "grey60", ylim = c(78, 78.5))
lines(mvec, col = "grey20")
text(6, 78.35, "Males")
text(6, 78.1, "Females")
dev.off()
## pdf
## 2