Results for Health paper

I'll try to interweave results - tabular and graphical - for the things you've highlighted in the revised outline (dated 28 May 2014).

Resting Females: Successful vs. Unsuccessful Comparisons

First thing we'll look at are the numbers from the two comparisons we make for resting and for available females. This comparison is to look at the health of successful vs. unsuccessful females. I'll show the plots, the statistics, and the tallies for the different groups.

## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:lubridate':
## 
##     here
## 
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, desc, failwith, id, mutate, summarise, summarize
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

To make the comparisons, we need to loop over years, and then find out two groups of individuals: 1) those available in a given year and pregnant the next year; and 2) those available in a given year, but still available in the next year.

Those data look like this:

head(pchealth)

##   year navail npregNext nhealth phealth
## 1 1985     12         4   75.30   70.88
## 2 1986      7         2   75.45   82.96
## 3 1987     13         7   78.65   74.77
## 4 1988     11         7   81.15   82.63
## 5 1989      6         4   74.18   79.94
## 6 1990     16         7   73.58   79.34

In the pchealth data frame, the variables are as follows:

navail is the sum of 3 to 2, and 3 to 4 animals, i.e. resting to pregnant, resting to available, respectively.
npregNext is the sum of the 3 to 2 animals.
phealth is the health of the animals who became pregnant
nphealth is the health of the animals who DID NOT became pregnant

We can plot the numbers easily to see how many are in each group:

plot of chunk unnamed-chunk-4

## pdf 
##   2

I'll write these out to a csv file for further comparison:

write.csv(pchealth, file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/data/Summary_RestingPregnantComparison.csv")

But we also want to see if these are a) different. To do that, we should work with the raw data, not the mean summaries. So we need to grab those. I'll do this by looping over year as before, but instead of grabbing the mean, I'll just grab the whole year's health values:

This yields a data frame where each row has one individual's health for a given year, along with its status. Here's what the first few lines of the data look like:

##      EGNo   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct
## 1118 1118 74.48 72.93 71.76 70.96 70.53 70.34 70.39 70.74 71.36 72.25
## 1127 1127 68.13 68.80 69.70 70.48 71.11 71.66 72.14 72.51 72.82 73.07
## 1135 1135 65.68 65.48 65.30 65.66 66.01 66.41 66.91 67.54 68.35 69.32
## 1157 1157 68.20 68.79 69.59 70.69 71.65 72.48 73.19 73.84 74.47 75.06
## 1142 1142 69.42 70.91 72.63 74.71 76.49 78.05 79.32 80.31 81.13 81.66
## 1145 1145 69.99 70.58 71.42 72.19 72.85 73.45 74.02 74.46 74.85 75.16
##        Nov   Dec year status
## 1118 73.06 73.85 1985     32
## 1127 73.20 73.33 1985     32
## 1135 70.14 70.81 1985     32
## 1157 75.66 76.22 1985     32
## 1142 82.05 82.29 1985     34
## 1145 75.49 75.76 1985     34

In the above, status of 32 means they were resting one year, and pregnant the next. Status of 34 means resting one year, but available the next.

Let's plot out the densities of these to check for normality:

plot of chunk unnamed-chunk-8

## pdf 
##   2

They seem to be left skewed to me. But here's the t.test comparing their means (n.b. I looked in R's help, and these 95% CIs are for the sample estimates as we presumed yesterday) are. You can see that the ones that transition to pregnant are healthier:

t.test(as.vector(pchealth[pchealth$status == 32, 2:13]), as.vector(pchealth[pchealth$status == 
    34, 2:13]))

## 
##  Welch Two Sample t-test
## 
## data:  as.vector(pchealth[pchealth$status == 32, 2:13]) and as.vector(pchealth[pchealth$status == 34, 2:13])
## t = 6.231, df = 3434, p-value = 5.213e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.261 2.418
## sample estimates:
## mean of x mean of y 
##     73.12     71.28

I'll write out the file for you to have the raw data.

write.csv(pchealth, file = "RestingPregnantComparisonRawData.csv")

Available Females: Successful vs. Unsuccessful Comparisons

Now let's look at the available females in the same way.

##   year navail npregNext nhealth phealth
## 1 1985     10         2   56.61   79.23
## 2 1986     13         3   77.94   79.30
## 3 1987     14         6   78.83   79.94
## 4 1988     14         3   76.53   75.41
## 5 1989     14         9   69.55   80.93
## 6 1990      7         2   58.21   78.00

In the avchealth data frame, the variables are as follows:

navail is the sum of 4 to 2, and 4 to 4 animals, i.e. available to pregnant, available to available, respectively.
npregNext is the sum of the 4 to 2 animals.
phealth is the health of the animals who became pregnant
nphealth is the health of the animals who DID NOT became pregnant

I'll write out the file for you:

write.csv(avchealth, file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/data/Summary_AvailablePregnantComparison.csv")

And then we can plot the numbers in each category by year.

plot of chunk unnamed-chunk-13

## pdf 
##   2

And plot the yearly mean scores of health:

plot of chunk unnamed-chunk-14

## pdf 
##   2

But we want to look at the raw data as well so we'll repeat that loop storing the actual health, not the means.

So let's try to improve upon that previous line plot of the summarised data (by year) with some paired boxplots of the raw data. Personally I don't find this effective, but here it is:

dfLong <- melt(avchealth, id.vars = c("year", "status"), measure.vars = 2:13)
p <- ggplot(dfLong, aes(factor(year), value)) + geom_boxplot(aes(fill = factor(status), 
    group = paste(factor(status), year))) + labs(y = "Estimated Health", x = "Year", 
    fill = "Reproductive Status") + scale_fill_grey(start = 1, end = 0.65, labels = c("Av - Preg", 
    "Av - Av")) + theme_bw()
p

plot of chunk unnamed-chunk-16


pdf(file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/images/HealthSummary_AvailAvailPregnantPairedBoxPlots.pdf")
p
dev.off()

## pdf 
##   2

And on that dataset we'll run the t-test:

t.test(as.vector(avchealth[avchealth$status == 42, 2:13]), as.vector(avchealth[avchealth$status == 
    44, 2:13]))

## 
##  Welch Two Sample t-test
## 
## data:  as.vector(avchealth[avchealth$status == 42, 2:13]) and as.vector(avchealth[avchealth$status == 44, 2:13])
## t = 12.92, df = 5538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.868 5.252
## sample estimates:
## mean of x mean of y 
##     74.55     69.99

As before, there are significant differences in the ones who transition to pregnant; they are healthier than those who transition to available.

Here are the density plots of these data:

## pdf 
##   2

These too are left skewed, and I'll write them out to you for further comparison.

write.csv(avchealth, file = "AvailablePregnantHealthComparisonsRawData.csv")

Health of Available Animals in 2007

We saw a dip in health of available animals towards the beginning of 2007. We'd like to see what's the cause of that. First let's grab the list of those animals, and then plot out their individual health.

plot of chunk av2007

So you can see from the plot that there are three animals that are estimated to be dead. These are:

badHealth <- unique(which(df[, 1:12] < 5, arr.ind = T)[, 1])
df[badHealth, "ID"]

## [1] 1248 2420 1013
## 30 Levels: 1012 1013 1140 1142 1145 1158 1175 1204 1233 1240 1241 ... 3010

Health of Available Animals vs. Calving Rates

Here let's plot out a scatter plot of health of those available in the current year vs. the ratio of # of calves next year / # available females this year. The filled circles is the health of those who got pregnant; the empty circles are those who didn't:

plot of chunk unnamed-chunk-21

## pdf 
##   2

Master Plot of Health of Available Animals (Including First Calf)

So the goal here is to just look at the distribution of health in the year two years prior to having a calf. n.b. that we're including the first calf here, which is different from the previous examinations of available.

First thing we need to do is grab the data and mold it into a useful form. The logic here is that we'll take the master data frame summary containing all individual health profiles, find the times that are the two years prior, and then set everything else to NA.

p.idx <- which(calves == 1, arr.ind = TRUE)  # this is the pregnancy year
aidx <- p.idx
aidx[, 2] <- p.idx[, 2] - 24  # this is the pregnancy year minus two years
aidx <- aidx[which(aidx[, 2] > 0), ]

pop.idx <- aidx
pophealth <- sumh/ng
psub <- matrix(NA, dim(pophealth)[1], dim(pophealth)[2])
psub[pop.idx] <- pophealth[pop.idx]
psub[psub < 5] <- NA  # This is to remove health of females that are before firstSight

quantile(psub, c(0.05, 0.25, 0.5, 0.75, 0.95), na.rm = TRUE)

##    5%   25%   50%   75%   95% 
## 54.46 71.27 77.89 82.52 87.42

And then we can visualise the histogram and density plot:

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-23

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## pdf 
##   2

Population Health Scores

Ok, let's look at the population level health scores. First, I need to factor out the young juveniles, and then marry up the juvenile times with the adult times for a consecutive health profile.

##   SightingEGNo   min_date   max_date MinDate MaxDate MinDateInt MaxDateInt
## 1         1005 1993-08-27 1997-08-23  8-1993  8-1997        284        332
## 2         1017 1988-04-03 2011-03-17  4-1988  3-2011        220        495
## 3         1019 1988-05-05 2009-07-18  5-1988  7-2009        221        475
## 4         1021 1988-04-03 1998-08-04  4-1988  8-1998        220        344
## 5         1026 1989-08-28 2009-09-03  8-1989  9-2009        176        477
## 6         1028 1988-09-02 2008-10-12  9-1988 10-2008        225        466

Then let's get the population health data set up using these individual and time indices

And then plotted.

plot of chunk unnamed-chunk-26

We also wanted to try this plot with lines for the variance instead of ribbons. That looks like this:

plot of chunk unnamed-chunk-27

## pdf 
##   2

I think this looks better. I like the ribbon, but if we include the date annotation rectangles, then these are better. I'll continue this style below.

Ok, then what are the summary stats of these data, and what do they look like? Fivenum summary contains: minimum, lower-hinge, median, upper-hinge, maximum.

## [1]  0.00 68.34 78.60 84.09 88.81

plot of chunk unnamed-chunk-28

Note that the 0's are likely from animals that will be estimated as dead. We can talk about pulling out the health values of the animals that the model has estimated to be dead.

Finally, let's add the calving #'s on top of the health plot to round out the picture a little bit.

plot of chunk unnamed-chunk-29

## pdf 
##   2

Sub-population Health

Ok, that was the population level health. What about some of the sub-population categories? We can turn to these now. I'll leave some of the code out since it's just a repeat of what we did to generate the population level values. We need summaries for:

Adult males
Young juvs
Old juvs
Pregnant
Lactating
Resting
Available

plot of chunk adultMales

## pdf 
##   2

plot of chunk youngJuveniles

## pdf 
##   2

plot of chunk oldJuveniles

## pdf 
##   2

And now let's look at the different reproductive Female categories. For the pregnant females, I can use the existing time query, but I'll alter it for lactating, resting, and available per our discussions over email. Here are the pregnant females:

plot of chunk pregFemales

## pdf 
##   2

Now let's adjust the query for the other times in the reproductive life cycle.

plot of chunk lacFemales

## pdf 
##   2

plot of chunk restFemales

## pdf 
##   2

plot of chunk avFemales

## pdf 
##   2

## Warning: row names were found from a short variable and have been
## discarded

Correlations between Health in Available & Resting Females and Calving Ratio

The correlation between health of all available females in one year and calves/year (in the following year) is:

cvec <- table(calfTable$Calving.Year)
cvec <- cvec[which(names(cvec) == 1986):(length(cvec) - 1)]
health <- dfavfem[dfavfem$category == "avfemHealth", -c(1, ncol(dfavfem))]
cvs <- data.frame(calfyr = names(cvec), numcalf = cvec)
hvs <- data.frame(hyear = names(health), hvals = t(health))
df <- cbind(hvs[1:24, ], cvs)
head(df)

##      hyear    X1 calfyr numcalf
## 1984  1984 77.35   1986      13
## 1985  1985 77.48   1987      11
## 1986  1986 78.25   1988       7
## 1987  1987 78.65   1989      16
## 1988  1988 80.19   1990      12
## 1989  1989 78.76   1991      17

cor(df$X1, df$numcalf, use = "complete.obs")

## [1] -0.1004

Note that I've done this quickly, and I'm somewhat tired. So I'll need to go back and check the logic. But it looks right so far. It's weak, but I still like scott's line of thinking about making the link (at least in discussion form) about this and pop consequences.

Health in Different Groups in Good vs. Bad Years

So now we want to examine the median health of animals in different sub groups during 3 time periods:

1992-1994
1997-1999
2004-2006

Let's set these up as indices we can recycle:

idx92 <- which(monYr[, 2] == 1992 | monYr[, 2] == 1993 | monYr[, 2] == 1994)
idx97 <- which(monYr[, 2] == 1997 | monYr[, 2] == 1998 | monYr[, 2] == 1999)
idx04 <- which(monYr[, 2] == 2004 | monYr[, 2] == 2005 | monYr[, 2] == 2006)

First let's look at the population level health, and show it with density plots of the yearly average plot of chunk population

## pdf 
##   2

Then we'll look at Adult Males, juveniles, pregnant, lactating, resting, & available females.

plot of chunk admales

## pdf 
##   2

plot of chunk yJuvs

## pdf 
##   2

plot of chunk oldJuvs

## pdf 
##   2

plot of chunk pregFemales3per

## pdf 
##   2

plot of chunk lacFemales3per

## pdf 
##   2

plot of chunk restFemales3per

## pdf 
##   2

plot of chunk avFemales3per

## pdf 
##   2

Summary Table Data Prep

Ok, we need to make the data to write to excel. What we want is for each row to correspond to a population or sub-population category, and then each column to have the yearly summary of health (with CIs). I'll copy this file, along with the others, to a “data” folder in the main dropbox paper folder.

dfAll <- rbind(dfPop, dfadMales, dfyjuves, dfojuves, dfpregfem, dflacfem, dfrestfem, 
    dfavfem)

## Error: numbers of columns of arguments do not match

write.csv(dfAll, file = "YearlyHealthSummariesByGroup.csv")

## Error: object 'dfAll' not found

Pairwise Comparison within One Individual: EGNo 1245

Ok, this is just a one-off test of doing a comparison of the 'before' and 'after' test for one individual. In this case the befores will be either resting or available years, and the after will be pregnant years. So 1245 has had 4 calves in these years:

calfTable[calfTable$EGNo == 1245, ]

##     EGNo Calving.Year CalfNo CalvingInterval
## 167 1245         1996   2645              NA
## 168 1245         2001   3145               5
## 169 1245         2005   3545               4
## 170 1245         2008   3845               3

idx <- which(ID == 1245)

df <- data.frame(status = newCalves[idx, ], dates = as.Date(paste(monYr[, 1], 
    1, monYr[, 2], sep = "/"), format = "%m/%d/%Y"))
df <- df[which(df$status != 0), ]
p <- ggplot(df, aes(dates, status, group = status, colour = factor(status))) + 
    geom_point() + labs(title = paste("EGNo: ", ID[idx], sep = ""), colour = "status") + 
    scale_colour_discrete(labels = c("Calving", "Pregnant", "Resting", "Available"))
p

plot of chunk unnamed-chunk-32

Ok, let's set up the vectors to get her health in these periods. What I will define as the successful transitions are either resting to pregnant (3 to 2) or available to pregnant (4 to 2). The unsuccessful transitions will be resting to available (3 to 4) or available to available (4 to 4).

Ok, now that we have the indices set up, let's get the health enumerated.

Phew. Now that that is behind us, let's examine health in good versus bad years for EGNo 1245:

t.test(success, failure)

## 
##  Welch Two Sample t-test
## 
## data:  success and failure
## t = 3.728, df = 45.73, p-value = 0.0005294
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.386 24.725
## sample estimates:
## mean of x mean of y 
##     77.11     61.05

Pairwise Comparison within All Individuals

Since the above proved successful, let's expand it to look at all individual females.

femID <- unique(sights[which(sights$Age == "A" & sights$GenderCode == "F"), 
    "SightingEGNo"])
outvals <- numeric(0)

for (j in 1:length(femID)) {
    idx <- femID[j]
    ridx <- which(row.names(newCalves) == idx)

    df <- data.frame(year = monYr[, 2], status = newCalves[ridx, ], change = newCalves[ridx, 
        ] - lead(newCalves[ridx, ]))
    rpidx <- monYr[which(df$status == 3 & df$change == 1), 2]
    apidx <- monYr[which(df$status == 4 & df$change == 2), 2]
    raidx <- monYr[which(df$status == 3 & df$change == -1), 2]
    avavidx <- numeric(0)
    yrseq <- 1980:2009
    for (i in 1:length(yrseq)) {
        yidx <- which(monYr[, 2] == yrseq[i])
        nyidx <- which(monYr[, 2] == yrseq[i + 1])
        cidx <- which(newCalves[ridx, yidx] == 4 & newCalves[ridx, nyidx] == 
            4, arr.ind = TRUE)
        if (length(cidx) == 0) {
            next
        } else {
            avavidx <- c(avavidx, yrseq[i])
        }
    }
    if (length(c(rpidx, apidx)) < 1 | length(c(raidx, avavidx)) < 1) {
        next
    }
    pophealth <- sumh/ng
    pophealth[which(pophealth < 5)] <- NA

    if (length(rpidx) > 0) {
        rphealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] == 
            rpidx)], status = 32)
    }
    if (length(apidx) > 0) {
        aphealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] == 
            apidx)], status = 42)
    }
    if (length(rpidx) > 0 & length(apidx) > 0) {
        success <- c(rphealth$health, aphealth$health)
    }
    if (length(rpidx) > 0 & length(apidx) == 0) {
        success <- c(rphealth$health)
    }
    if (length(rpidx) == 0 & length(apidx) > 0) {
        success <- c(aphealth$health)
    }

    if (length(raidx) > 0) {
        rahealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] == 
            raidx)], status = 34)
    }
    if (length(avavidx) > 0) {
        aahealth <- data.frame(health = pophealth[ridx, which(monYr[, 2] == 
            avavidx)], status = 44)
    }
    if (length(avavidx) > 0 & length(raidx) > 0) {
        failure <- c(rahealth$health, aahealth$health)
    }
    if (length(avavidx) > 0 & length(raidx) == 0) {
        failure <- c(aahealth$health)
    }
    if (length(avavidx) == 0 & length(raidx) > 0) {
        failure <- c(rahealth$health)
    }
    res <- t.test(success, failure)

    dfout <- c(res$statistic, res$parameter, res$p.value, res$estimate, res$conf.int[1], 
        res$conf.int[2], femID[j])
    outvals <- rbind(outvals, dfout)

}

colnames(outvals) <- c("t-statistic", "DegFreedom", "p.value", "X - Mean", "Y - Mean", 
    "CI - Success", "CI - Failure", "EGNo")
write.csv(outvals, file = "pairedHealthComparison.csv")

Health Climatology

Goal here is to plot out the average monthly health over all years to see if we can generate some sort of climatology type picture. Let's prep the data first

pop.idx <- popAll.idx
pophealth <- sumh/ng
for (i in 1:nrow(pop.idx)) {
    pophealth[rownames(pophealth) == pop.idx$SightingEGNo[i], 1:(pop.idx$MinDateInt[i] - 
        1)] <- NA  # before first sight 
    pophealth[rownames(pophealth) == pop.idx$SightingEGNo[i], (pop.idx$MaxDateInt[i] + 
        1):ncol(pophealth)] <- NA  # after last time
}
pophealth[which(pophealth < 5)] <- NA
mhealth <- pophealth[gender == "M", ]
femhealth <- pophealth[gender == "F", ]

mvec <- rep(NA, 12)
femquant <- mquant <- matrix(NA, nrow = 3, ncol = 12)
femvec <- rep(NA, 12)
yrs <- unique(monYr[, 2])
for (i in 1:12) {
    cols <- which(monYr[, 1] == i)
    mvec[i] <- median(mhealth[, cols], na.rm = TRUE)
    femvec[i] <- median(femhealth[, cols], na.rm = TRUE)
    femquant[, i] <- quantile(femhealth[, cols], c(0.05, 0.5, 0.95), na.rm = TRUE)
    mquant[, i] <- quantile(mhealth[, cols], c(0.05, 0.5, 0.95), na.rm = TRUE)
}

And then visualise that

plot(femvec, type = "b", col = "grey60", ylim = c(78, 78.5))
lines(mvec, col = "grey20")
text(6, 78.35, "Males")
text(6, 78.1, "Females")

plot of chunk unnamed-chunk-37


pdf(file = "/Users/rob/Dropbox/Papers/RollandEtAl_RightWhaleHealth/images/HealthClimatology.pdf")
plot(femvec, type = "b", col = "grey60", ylim = c(78, 78.5))
lines(mvec, col = "grey20")
text(6, 78.35, "Males")
text(6, 78.1, "Females")
dev.off()

## pdf 
##   2