Primary Questions of Interest:
What was the overall time to solving the problems (lever pulling, paper ripping, and persistence & repeatability) for chickadees? What covariates/characteristics were most closely predictive of whether a chickadee figured out the problems?
Loading Packages:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(skimr)
library(survival)
library(survminer)
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
##
## The following object is masked from 'package:survival':
##
## myeloma
library(fitdistrplus)
## Loading required package: MASS
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
Reading in datasets:
lever = read_csv('LeverPullingPerformance.csv')
## Rows: 2779 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): indiv, site, habitat, age, sex
## dbl (9): latency, status, tstart, tstop, contacts, domscore, urbanscore, exp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
paper = read_csv('PaperRippingPerformance.csv')
## Rows: 3431 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): indiv, site, habitat
## dbl (9): latency, status, tstart, tstop, contacts, domscore, urbanscore, exp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pr = read_csv('Persistence&Repeatability.csv')
## Rows: 70 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): site, indiv, age, sex, habitat, lpsolve, prsolve
## dbl (10): capturedate, urbanscore, explscore, domscore, lplatency, lpstatus,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
DATASET 1: Lever
Variables:
indiv: individual band number identification latency: latency to solve or become censored in the lever-pulling trials status: censoring status of latency to solve the lever-pulling task used for cox regression (0 = censored, 1 = uncensored) tstart: start time (in seconds) of each time interval used for the extended cox regression tstop: stop time (in seconds) of each time interval used for the extended cox regression contacts: number of contacts made with the task in each time interval domscore: dominance score based on interactions between birds at RFID feeders. urbanscore: urbanization score generated by PCA analysis used to infer degree of urbanization at each site. explscore: exploration score generated by PCA analysis. solve: indicator of whether the task was solved in each time interval (0 = unsolved, 1 = solved) site: code of each field site. habitat: rural (urban score < 0) or urban (urban score > 0) habitat type. age: age class – adult or juvenile. sex: female or male.
Preliminary EDA of lever dataset:
skim(lever)
| Name | lever |
| Number of rows | 2779 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| indiv | 0 | 1.00 | 10 | 10 | 0 | 65 | 0 |
| site | 0 | 1.00 | 2 | 3 | 0 | 7 | 0 |
| habitat | 0 | 1.00 | 5 | 5 | 0 | 2 | 0 |
| age | 268 | 0.90 | 5 | 8 | 0 | 2 | 0 |
| sex | 472 | 0.83 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| latency | 0 | 1.00 | 5711.41 | 2240.41 | 11.00 | 3924.00 | 6924.00 | 7577.00 | 7834.00 | ▁▁▃▁▇ |
| status | 0 | 1.00 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| tstart | 0 | 1.00 | 2841.27 | 2103.15 | 0.00 | 1020.00 | 2460.00 | 4320.00 | 7800.00 | ▇▆▅▃▂ |
| tstop | 0 | 1.00 | 2901.27 | 2103.15 | 60.00 | 1080.00 | 2520.00 | 4380.00 | 7860.00 | ▇▆▅▃▂ |
| contacts | 66 | 0.98 | 0.73 | 4.44 | 0.00 | 0.00 | 0.00 | 0.00 | 68.00 | ▇▁▁▁▁ |
| domscore | 1192 | 0.57 | 0.49 | 0.27 | 0.07 | 0.29 | 0.43 | 0.67 | 1.00 | ▆▇▃▅▆ |
| urbanscore | 0 | 1.00 | -0.40 | 1.77 | -2.00 | -2.00 | -1.62 | 1.96 | 2.48 | ▇▁▂▁▃ |
| explscore | 0 | 1.00 | -0.45 | 2.33 | -2.07 | -1.78 | -1.38 | -0.18 | 9.21 | ▇▁▁▁▁ |
| solve | 0 | 1.00 | 0.01 | 0.12 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
summary(lever)
## indiv latency status tstart
## Length:2779 Min. : 11 Min. :0.0000 Min. : 0
## Class :character 1st Qu.:3924 1st Qu.:0.0000 1st Qu.:1020
## Mode :character Median :6924 Median :0.0000 Median :2460
## Mean :5711 Mean :0.2465 Mean :2841
## 3rd Qu.:7577 3rd Qu.:0.0000 3rd Qu.:4320
## Max. :7834 Max. :1.0000 Max. :7800
##
## tstop contacts domscore urbanscore
## Min. : 60 Min. : 0.000 Min. :0.0714 Min. :-1.9999
## 1st Qu.:1080 1st Qu.: 0.000 1st Qu.:0.2857 1st Qu.:-1.9999
## Median :2520 Median : 0.000 Median :0.4286 Median :-1.6219
## Mean :2901 Mean : 0.735 Mean :0.4926 Mean :-0.3982
## 3rd Qu.:4380 3rd Qu.: 0.000 3rd Qu.:0.6667 3rd Qu.: 1.9574
## Max. :7860 Max. :68.000 Max. :1.0000 Max. : 2.4818
## NA's :66 NA's :1192
## explscore solve site habitat
## Min. :-2.0694 Min. :0.00000 Length:2779 Length:2779
## 1st Qu.:-1.7765 1st Qu.:0.00000 Class :character Class :character
## Median :-1.3848 Median :0.00000 Mode :character Mode :character
## Mean :-0.4529 Mean :0.01367
## 3rd Qu.:-0.1824 3rd Qu.:0.00000
## Max. : 9.2119 Max. :1.00000
##
## age sex
## Length:2779 Length:2779
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
head(lever)
## # A tibble: 6 × 14
## indiv latency status tstart tstop conta…¹ domsc…² urban…³ expls…⁴ solve site
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2780-… 3699 0 0 60 0 0.833 1.96 -0.728 0 FRP
## 2 2780-… 3699 0 60 120 19 0.833 1.96 -0.728 0 FRP
## 3 2780-… 3699 0 120 180 21 0.833 1.96 -0.728 0 FRP
## 4 2780-… 3699 0 180 240 4 0.833 1.96 -0.728 0 FRP
## 5 2780-… 3699 0 240 300 2 0.833 1.96 -0.728 0 FRP
## 6 2780-… 3699 0 300 360 0 0.833 1.96 -0.728 0 FRP
## # … with 3 more variables: habitat <chr>, age <chr>, sex <chr>, and abbreviated
## # variable names ¹contacts, ²domscore, ³urbanscore, ⁴explscore
#it looks like each row is a 60 second interval, so for each of 65 birds that are in this dataset there are one or multiple rows, depending on how long it took them to solve the lever problem.
lever %>%
group_by(indiv) %>%
count()
## # A tibble: 65 × 2
## # Groups: indiv [65]
## indiv n
## <chr> <int>
## 1 2730-71136 1
## 2 2730-71137 4
## 3 2780-73204 6
## 4 2780-73248 2
## 5 2780-73253 1
## 6 2780-73254 5
## 7 2780-73255 50
## 8 2780-73256 1
## 9 2780-73288 62
## 10 2780-73289 22
## # … with 55 more rows
lever %>%
filter(solve == 1) %>%
group_by(indiv) %>%
summarize(time = max(tstop), status)
## # A tibble: 38 × 3
## indiv time status
## <chr> <dbl> <dbl>
## 1 2730-71136 60 1
## 2 2730-71137 240 1
## 3 2780-73204 360 1
## 4 2780-73248 120 1
## 5 2780-73253 60 1
## 6 2780-73254 300 1
## 7 2780-73256 60 1
## 8 2780-73289 1320 1
## 9 2780-73299 6780 1
## 10 2780-73300 60 1
## # … with 28 more rows
#38 of the 65 birds in dataset solved the problem
lever %>%
filter(status == 0) %>%
group_by(indiv) %>%
summarize(time = max(tstop))
## # A tibble: 27 × 2
## indiv time
## <chr> <dbl>
## 1 2780-73255 3000
## 2 2780-73288 3720
## 3 2780-73290 3720
## 4 2780-73292 3720
## 5 2780-73293 3960
## 6 2780-73295 3960
## 7 2780-73296 3960
## 8 2780-73298 3960
## 9 2780-73301 120
## 10 2780-73302 7860
## # … with 17 more rows
#27 of 65 birds in dataset did not solve the lever problem
lever %>%
group_by(indiv) %>%
summarize(min_time = min(tstart))
## # A tibble: 65 × 2
## indiv min_time
## <chr> <dbl>
## 1 2730-71136 0
## 2 2730-71137 0
## 3 2780-73204 0
## 4 2780-73248 0
## 5 2780-73253 0
## 6 2780-73254 0
## 7 2780-73255 0
## 8 2780-73256 0
## 9 2780-73288 0
## 10 2780-73289 0
## # … with 55 more rows
#all the birds start at 0
#this summary table will be used to for our survival analysis so there is 1 row per bird with the time-to-event (pulling the lever) as variable "time"
lever2 = lever %>%
group_by(status, indiv, habitat,age) %>%
summarize (time = max(tstop),
contacts = sum(contacts),
domscore = unique(domscore),
urbanscore = unique(urbanscore),
explscore = unique(explscore),
solve = max(solve))
## `summarise()` has grouped output by 'status', 'indiv', 'habitat'. You can
## override using the `.groups` argument.
#1st km curve for all birds in trial (no grouping)
lever2.km = survfit(Surv(lever2$time,lever2$status) ~ 1)
#km table
summary(lever2.km)
## Call: survfit(formula = Surv(lever2$time, lever2$status) ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 60 65 9 0.862 0.0428 0.782 0.950
## 120 55 5 0.783 0.0513 0.689 0.891
## 180 49 2 0.751 0.0540 0.653 0.865
## 240 47 4 0.687 0.0581 0.582 0.811
## 300 42 4 0.622 0.0611 0.513 0.754
## 360 38 1 0.605 0.0616 0.496 0.739
## 420 37 1 0.589 0.0621 0.479 0.724
## 540 35 1 0.572 0.0626 0.462 0.709
## 900 34 1 0.555 0.0629 0.445 0.694
## 1020 33 2 0.522 0.0635 0.411 0.662
## 1080 31 1 0.505 0.0636 0.394 0.646
## 1320 30 1 0.488 0.0637 0.378 0.630
## 3660 25 1 0.469 0.0641 0.358 0.613
## 3960 21 1 0.446 0.0648 0.336 0.593
## 4020 16 1 0.418 0.0665 0.306 0.571
## 5400 14 1 0.389 0.0681 0.276 0.548
## 6780 13 1 0.359 0.0691 0.246 0.523
## 6960 12 1 0.329 0.0695 0.217 0.498
print(lever2.km)
## Call: survfit(formula = Surv(lever2$time, lever2$status) ~ 1)
##
## n events median 0.95LCL 0.95UCL
## [1,] 65 38 1320 360 6960
#transforming y to be 1 -S(t) for better interpretability given type of event (solving a problem)
ggsurvplot(fit = lever2.km, data = lever2, risk.table = F, fun = "event", conf.int = T, surv.median.line = "hv") +
labs(title = "Survival Analysis of Chickadees Solving Lever Problem",
y = "Probability of Pulling Lever",
x = "Time (seconds)")
Binning bird characteristic variables for survival analysis
#binning domscore:
lever2 = lever2 %>% mutate(dom_bin = case_when(
domscore <= .50 ~ "low",
domscore > .50 ~ "high"))
#distribution of explscore:
ggplot(data = lever2, aes(x = explscore)) + geom_histogram(binwidth = 3)
#there is quite the outlier! That's one intrepid bird.
#binning explscore:
lever2 = lever2 %>% mutate(expl_bin = case_when(
explscore <= -.89 ~ "low",
explscore > -.89 ~ "high"))
Age
#creating new km table, grouping by type of habitat and age of bird:
lever2.km2 = survfit(Surv(lever2$time,lever2$status) ~ lever2$age)
ggsurvplot(fit = lever2.km2, data = lever2, risk.table = F, conf.int = F, fun = "event", linetype = "strata", legend = "right", palette="Paired") +
labs(title = "Adults are better at Lever Pulling Task",
y = "Probability of Pulling Lever",
x = "Time (seconds)")
#log rank test
survdiff(Surv(lever2$time,lever2$status) ~ lever2$age)
## Call:
## survdiff(formula = Surv(lever2$time, lever2$status) ~ lever2$age)
##
## n=58, 7 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## lever2$age=adult 33 23 17.2 1.97 4.35
## lever2$age=juvenile 25 11 16.8 2.01 4.35
##
## Chisq= 4.3 on 1 degrees of freedom, p= 0.04
# p = .04
Dominance Score affects bird’s task performance
#km table, grouping by how dominant and exploratory the birds are:
lever2.km3 = survfit(Surv(lever2$time,lever2$status) ~ strata(lever2$dom_bin))
ggsurvplot(fit = lever2.km3, data = lever2, risk.table = F, conf.int = F, fun = "event", linetype = "strata", legend = "right", palette="Paired") +
labs(title = "The More Dominant are Better at Pulling Lever",
y = "Probability of Pulling Lever",
x = "Time (seconds)")
#log rank test
survdiff(Surv(lever2$time,lever2$status) ~ lever2$dom_bin) #p = .09
## Call:
## survdiff(formula = Surv(lever2$time, lever2$status) ~ lever2$dom_bin)
##
## n=39, 26 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## lever2$dom_bin=high 22 15 11 1.41 2.94
## lever2$dom_bin=low 17 8 12 1.31 2.94
##
## Chisq= 2.9 on 1 degrees of freedom, p= 0.09
survdiff(Surv(lever2$time,lever2$status) ~ lever2$domscore) #p = .05
## Call:
## survdiff(formula = Surv(lever2$time, lever2$status) ~ lever2$domscore)
##
## n=39, 26 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## lever2$domscore=0.071428571 1 0 0.670 6.70e-01 7.29e-01
## lever2$domscore=0.125 1 0 0.670 6.70e-01 7.29e-01
## lever2$domscore=0.142857143 2 2 1.290 3.90e-01 4.38e-01
## lever2$domscore=0.166666667 1 0 0.742 7.42e-01 8.06e-01
## lever2$domscore=0.25 2 1 1.246 4.85e-02 5.42e-02
## lever2$domscore=0.285714286 1 0 1.162 1.16e+00 1.30e+00
## lever2$domscore=0.333333333 2 1 1.027 7.36e-04 8.15e-04
## lever2$domscore=0.357142857 1 0 0.670 6.70e-01 7.29e-01
## lever2$domscore=0.375 1 0 1.162 1.16e+00 1.30e+00
## lever2$domscore=0.428571429 1 1 0.996 2.02e-05 2.23e-05
## lever2$domscore=0.444444444 1 0 1.162 1.16e+00 1.30e+00
## lever2$domscore=0.5 3 3 1.156 2.94e+00 3.31e+00
## lever2$domscore=0.571428571 2 1 0.374 1.05e+00 1.21e+00
## lever2$domscore=0.625 1 1 0.617 2.37e-01 2.58e-01
## lever2$domscore=0.642857143 1 1 0.567 3.30e-01 3.61e-01
## lever2$domscore=0.666666667 2 0 1.904 1.90e+00 2.17e+00
## lever2$domscore=0.714285714 1 1 0.128 5.93e+00 6.80e+00
## lever2$domscore=0.75 2 2 0.787 1.87e+00 2.09e+00
## lever2$domscore=0.785714286 1 1 0.433 7.42e-01 8.14e-01
## lever2$domscore=0.833333333 1 0 0.742 7.42e-01 8.06e-01
## lever2$domscore=0.857142857 1 0 1.162 1.16e+00 1.30e+00
## lever2$domscore=0.875 1 1 0.187 3.53e+00 3.95e+00
## lever2$domscore=0.928571429 2 2 0.981 1.06e+00 1.17e+00
## lever2$domscore=1 7 5 3.164 1.06e+00 1.37e+00
##
## Chisq= 35.5 on 23 degrees of freedom, p= 0.05
#This large p values are probably due to the very small sample size!
Initial Findings:
From the survival curves above, the data suggests adult, rural birds with a high dominance and exploration scores are the most likely to complete the lever pulling task. It appears that age and dominance play a large role in determining likelihood in solving the problem.
#preparing lever2 dataset for fitdistcens:
lever3 = lever2 %>% mutate(left= ifelse(time == 0,time+.001,time),
right= ifelse(status == 1,left, NA))
#weibull
weibullFit.lever3=fitdistcens(data.frame(lever3[,c("left","right")]), dist = "weibull")
summary(weibullFit.lever3)$estimate
## shape scale
## 0.4676573 5041.4391445
plot(weibullFit.lever3)
#make weibull act expo using fix.arg shape = 1 ?
#exponential
#expFit.lever3 = fitdistcens(data.frame(lever3[,c("left","right")]), dist = "exp") #exponential does not work #summary(expFit.lever3)
#plot(expFit.lever3)
#lognnormal
logFit.lever3 = fitdistcens(data.frame(lever3[,c("left","right")]), dist = "lnorm")
summary(logFit.lever3)$estimate
## meanlog sdlog
## 7.536097 2.600340
plot(logFit.lever3)
#assessing goodness of fits
weibullFit.lever3$aic
## [1] 674.9868
logFit.lever3$aic
## [1] 665.9616
#gammaFit.lever3 = fitdistcens(data.frame(lever3[,c("left","right")]), dist = "gamma")
#gamma doesn't work either!
#our lognormal distribution is the best fit
cdfcompcens(list(logFit.lever3,weibullFit.lever3),
legendtext=c("Lognormal","Weibull"),
xlab="Time to Pull Lever (seconds)",
plotstyle = "ggplot")
qqcompcens(list(logFit.lever3,weibullFit.lever3),
legendtext=c("Lognormal","Weibull")
,plotstyle = "ggplot")
Fitting parametric lognormal curve
meanlog.lever = summary(logFit.lever3)$estimate[1]
sdlog.lever = summary(logFit.lever3)$estimate[2]
lnorm_values = rlnorm(1000, meanlog = meanlog.lever, sdlog = sdlog.lever)
modelfit.lever = tibble(time = pmin(8000,lnorm_values)) %>%
mutate(status = ifelse(time >= 8000, 0,1), dataType = "model")
observed.lever = lever2 %>% ungroup %>% dplyr::select(time, status) %>% mutate(dataType = "observed")
dist.lever = bind_rows(modelfit.lever,observed.lever)
km.lever = survfit(Surv(dist.lever$time,dist.lever$status) ~ strata(dist.lever$dataType))
ggsurvplot(fit = km.lever, data = dist.lever, risk.table = F, conf.int = T
, fun = "event"
, linetype = "strata"
, palette="Dark2"
, legend = "right"
, surv.median.line = 'hv') +
labs(title = "Lognormal Fit of Chickadee Lever Pulling Probability")
#What percent of birds pulled the lever within 1800 second (a half hour?)
plnorm(1800, meanlog = meanlog.lever, sdlog = sdlog.lever)
## [1] 0.4937783
#.49
#at what time would half have the birds have pulled the lever?
qlnorm(0.5, meanlog = meanlog.lever, sdlog = sdlog.lever, lower.tail = TRUE )
## [1] 1874.5
#1874.5
Weibull Regression
Weibull regression modeling 100 birds each for 2 different levels of dominance
survObject=Surv(lever3$left,lever3$right,type="interval2")
weibullMod = survreg(survObject~lever3$domscore, dist = "weibull")
summary(weibullMod)
##
## Call:
## survreg(formula = survObject ~ lever3$domscore, dist = "weibull")
## Value Std. Error z p
## (Intercept) 10.267 0.999 10.28 < 2e-16
## lever3$domscore -3.085 1.360 -2.27 0.02331
## Log(scale) 0.584 0.175 3.33 0.00087
##
## Scale= 1.79
##
## Weibull distribution
## Loglik(model)= -202.7 Loglik(intercept only)= -205.4
## Chisq= 5.43 on 1 degrees of freedom, p= 0.02
## Number of Newton-Raphson Iterations: 5
## n=39 (26 observations deleted due to missingness)
b0 = 10.267
b1 = -3.085
#make random dominance scores for birds:
random_numbers <- function(n, min, max) {
random_values <- runif(n, min = min, max = max)
return(random_values)
}
lows = random_numbers(1000,.01,.5)
highs = random_numbers(1000,.51,.99)
sigma = summary(weibullMod)$scale
#choose domscores of .25 and .75 to simulate
low_dom_birds = rweibull(1000, shape = 1/sigma, scale=exp(b0 + b1*lows))
#dom50_rural_adult = rweibull(100, shape = 1/sigma, scale=exp(b0 + b1*.50))
high_dom_birds = rweibull(1000, shape = 1/sigma, scale=exp(b0 + b1*highs))
modelfit = tibble(domscore = c(lows,highs),
time = pmin(8000,c(low_dom_birds,high_dom_birds))) %>%
mutate(status = ifelse(time == 8000, 0,1), dataType = "model") %>%
mutate(dom_bin = case_when(
domscore <= .50 ~ "low",
domscore > .50 ~ "high"))
lever4 = lever3 %>% dplyr::select(domscore,dom_bin,time, status)
## Adding missing grouping variables: `indiv`, `habitat`
lever4 = lever4 %>% ungroup %>% dplyr::select(-indiv, -habitat) %>%
mutate(dataType = "observed")
together = bind_rows(lever4, modelfit)
km7 = survfit(Surv(together$time, together$status) ~ strata(together$dom_bin) + strata(together$dataType))
ggsurvplot(fit = km7, data = together, fun = "event", conf.int = F, legend = "right", palette="Paired")+
labs(title = "Weibull Regression of Chickadee Lever Pulling Probability")
Diagnostics
#diagnostics
devresids=residuals(weibullMod,type="deviance")
#lever3 %>%
#mutate(devresids=devresids) %>%
#ggplot(aes(y=devresids,x=age,fill=age)) +
# geom_boxplot() +
#geom_hline(yintercept = 0,linetype="dotted") +
#labs(title="Deviance Residuals for logNormal Model", x="age",y="Deviance Residuals") +
#coord_flip() +
#theme_minimal()
#plotting residuals in throwing error due to presence of NAs, I believe.
#these models may not be most appropriate for this dataset
Cox proportional hazards model: Explanation of use of cox proportional hazards regression from study:
“When analysing the predictors of problem-solving performance, we conducted an extended cox proportional hazards regression. This is a semi-parametric survival analysis approach that makes no assumptions about the distribution of the response variable [49]. We used this approach as an alternative to assigning capped latencies to individuals that did not solve a problem, which would suggest that the bird solved at that time and may not allow the data to meet assumptions of normality for commonly used regression analyses. Using censored observations allows us to avoid those assumptions and incorporate our limited knowledge of eachbird’s performance on a task (e.g. we only know that an unsuccessful bird was not able to solvea task in the time given)”
In other words, given the quantity of birds that did not complete the task and are therefore censored, using a semi-parametric regression model
cox_mod = coxph(Surv(time, status) ~ domscore + explscore + habitat + age, data = lever3)
summary(cox_mod)
## Call:
## coxph(formula = Surv(time, status) ~ domscore + explscore + habitat +
## age, data = lever3)
##
## n= 35, number of events= 20
## (30 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## domscore 1.8330 6.2527 0.9292 1.973 0.0485 *
## explscore 0.1845 1.2027 0.1018 1.813 0.0698 .
## habitaturban -0.8638 0.4216 0.5757 -1.500 0.1335
## agejuvenile -0.3058 0.7365 0.5657 -0.541 0.5888
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## domscore 6.2527 0.1599 1.0118 38.639
## explscore 1.2027 0.8315 0.9852 1.468
## habitaturban 0.4216 2.3722 0.1364 1.303
## agejuvenile 0.7365 1.3577 0.2430 2.232
##
## Concordance= 0.706 (se = 0.077 )
## Likelihood ratio test= 10.56 on 4 df, p=0.03
## Wald test = 10.38 on 4 df, p=0.03
## Score (logrank) test = 11.11 on 4 df, p=0.03
ggsurvplot(survfit(cox_mod), fun = "event", data = lever3, surv.median.line = "hv", palette="Paired")+
labs(title = "Cox Mod of Chickadee Lever Pulling Probability")
#this plot makes sense in comparing the median value of
median(lever3$time)
## [1] 1020
DATASET 2: Paper
indiv: individual band number identification. latency: latency to solve or become censored in the paper-ripping trials. status: censoring status of latency to solve the paper-ripping task used for cox regression (0 = censored, 1 = uncensored). tstart: start time (in seconds) of each time interval used for the extended cox regression. tstop: stop time (in seconds) of each time interval used for the extended cox regression. contacts: number of contacts made with the task in each time interval. domscore: dominance score based on interactions between birds at RFID feeders. urbanscore: urbanization score generated by PCA analysis used to infer degree of urbanization at each site. explscore: exploration score generated by PCA analysis. solve: indicator of whether the task was solved in each time interval (0 = unsolved, 1 = solved). site: code of each field site. habitat: rural (urban score < 0) or urban (urban score > 0) habitat type.
Getting acquainted with the Paper trial dataset
paper
## # A tibble: 3,431 × 12
## indiv latency status tstart tstop conta…¹ domsc…² urban…³ expls…⁴ solve site
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2780… 45 1 0 60 17 0.833 1.96 -0.728 1 FRP
## 2 2780… 3751 0 0 60 0 0.5 1.96 -1.58 0 FRP
## 3 2780… 3751 0 60 120 3 0.5 1.96 -1.58 0 FRP
## 4 2780… 3751 0 120 180 0 0.5 1.96 -1.58 0 FRP
## 5 2780… 3751 0 180 240 3 0.5 1.96 -1.58 0 FRP
## 6 2780… 3751 0 240 300 2 0.5 1.96 -1.58 0 FRP
## 7 2780… 3751 0 300 360 3 0.5 1.96 -1.58 0 FRP
## 8 2780… 3751 0 360 420 0 0.5 1.96 -1.58 0 FRP
## 9 2780… 3751 0 420 480 0 0.5 1.96 -1.58 0 FRP
## 10 2780… 3751 0 480 540 0 0.5 1.96 -1.58 0 FRP
## # … with 3,421 more rows, 1 more variable: habitat <chr>, and abbreviated
## # variable names ¹contacts, ²domscore, ³urbanscore, ⁴explscore
skim(paper)
| Name | paper |
| Number of rows | 3431 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| indiv | 0 | 1 | 10 | 10 | 0 | 60 | 0 |
| site | 0 | 1 | 2 | 3 | 0 | 7 | 0 |
| habitat | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| latency | 0 | 1.00 | 5702.94 | 2064.27 | 18.00 | 3751.00 | 7079.00 | 7373.00 | 8098.00 | ▁▁▅▂▇ |
| status | 0 | 1.00 | 0.19 | 0.39 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| tstart | 0 | 1.00 | 2834.16 | 2036.76 | 0.00 | 1140.00 | 2520.00 | 4260.00 | 8040.00 | ▇▇▅▃▂ |
| tstop | 0 | 1.00 | 2894.16 | 2036.76 | 60.00 | 1200.00 | 2580.00 | 4320.00 | 8100.00 | ▇▇▅▃▂ |
| contacts | 0 | 1.00 | 0.26 | 1.81 | 0.00 | 0.00 | 0.00 | 0.00 | 50.00 | ▇▁▁▁▁ |
| domscore | 1676 | 0.51 | 0.57 | 0.30 | 0.07 | 0.33 | 0.50 | 0.88 | 1.00 | ▅▅▅▃▇ |
| urbanscore | 0 | 1.00 | -0.47 | 1.76 | -2.00 | -1.89 | -1.62 | 1.96 | 2.48 | ▇▁▂▁▃ |
| explscore | 0 | 1.00 | 0.07 | 2.53 | -2.07 | -1.70 | -1.11 | 0.84 | 9.21 | ▇▂▁▁▁ |
| solve | 0 | 1.00 | 0.01 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
paper2 = paper %>% group_by(status, indiv, habitat) %>% summarize (time = max(tstop), contacts = sum(contacts), domscore = unique(domscore), urbanscore = unique(urbanscore), explscore = unique(explscore), solve = max(solve))
## `summarise()` has grouped output by 'status', 'indiv'. You can override using
## the `.groups` argument.
#general km curve all birds (no strata):
paper2.km = survfit(Surv(paper2$time, paper2$status) ~1)
ggsurvplot(fit = paper2.km, data = paper2, risk.table = F, conf.int = T, fun = "event", surv.median.line = "hv", palette = 'Paired') +
labs(title = "KM Curve of Chickadee Paper Shredding Probability")
Parametric fits
paper3 = paper2 %>% mutate(left= ifelse(time == 0,time+.001,time),
right= ifelse(status == 1,left, NA))
#weibull
weibullFit.paper3=fitdistcens(data.frame(paper3[,c("left","right")]), dist = "weibull")
summary(weibullFit.paper3)$estimate
## shape scale
## 4.896717e-01 1.151936e+04
plot(weibullFit.paper3)
#exponential - does not work! Should coerce weibull to act like exp with fixed arg shape = 1
#expFit.paper3 = fitdistcens(data.frame(paper3[,c("left","right")]), dist = "exp")
#summary(expFit.paper3)
#plot(expFit.paper3)
#lognnormal
logFit.paper3 = fitdistcens(data.frame(paper3[,c("left","right")]), dist = "lnorm")
summary(logFit.paper3)$estimate
## meanlog sdlog
## 8.545151 2.782355
plot(logFit.paper3)
weibullFit.paper3$aic
## [1] 551.7791
logFit.paper3$aic
## [1] 546.7332
#log wins again!
Plotting both non-parametric and parametric curves
#getting my values from lnorm fit:
meanlog = summary(logFit.paper3)$estimate[1]
sdlog = summary(logFit.paper3)$estimate[2]
#generating values from lnorm distribution
lnorm_values = rlnorm(1000, meanlog = meanlog, sdlog = sdlog)
modelfit.paper = tibble(time = pmin(8000,lnorm_values)) %>%
mutate(status = ifelse(time == 8000, 0,1), dataType = "model")
observed.paper = paper2 %>% ungroup %>% dplyr::select(time, status) %>% mutate(dataType = "observed")
dist.paper = bind_rows(modelfit.paper,observed.paper)
km.paper = survfit(Surv(dist.paper$time,dist.paper$status) ~ strata(dist.paper$dataType))
ggsurvplot(fit = km.paper, data = dist.paper, risk.table = F, conf.int = T
, fun = "event"
, palette = 'Dark2'
, linetype = "strata"
, legend = "right"
, surv.median.line = 'hv') +
labs(title = "Lognormal Fit of Chickadee Paper Ripping Probability", y = "Probability of Ripping Paper")
qlnorm(.50,meanlog = meanlog, sdlog = sdlog)
## [1] 5141.76
Dominance characteristic did not have a huge affect on ability to solve problem:
paper2 = paper2 %>% mutate(dom_bin = case_when(domscore <= .5 ~ "low",
domscore > .5 ~ "high"))
paper2.km2 = survfit(Surv(paper2$time,paper2$status) ~ paper2$dom_bin)
ggsurvplot(fit = paper2.km2, data = paper2, risk.table = F, conf.int = F, fun = "event", palette = 'Paired') +
labs(title = "Does more dominant mean better at paper test?", y = "Probability of Ripping Paper")
#log rank test
survdiff(Surv(paper2$time,paper2$status) ~ paper2$dom_bin)
## Call:
## survdiff(formula = Surv(paper2$time, paper2$status) ~ paper2$dom_bin)
##
## n=36, 24 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## paper2$dom_bin=high 20 14 9.42 2.23 4.42
## paper2$dom_bin=low 16 6 10.58 1.98 4.42
##
## Chisq= 4.4 on 1 degrees of freedom, p= 0.04
paper2 %>% group_by(dom_bin, status) %>% count()
## # A tibble: 6 × 3
## # Groups: dom_bin, status [6]
## dom_bin status n
## <chr> <dbl> <int>
## 1 high 0 6
## 2 high 1 14
## 3 low 0 10
## 4 low 1 6
## 5 <NA> 0 15
## 6 <NA> 1 9
#there are not enough observations in each group to create parametric curve on how dominance affects ability to rip paper to get at tasty worms
looking at expl_score
paper2 %>% ggplot() + geom_histogram(aes(x = explscore))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
med_explscore = median(paper2$explscore)
#binning expl_score
paper2 = paper2 %>% mutate(expl_bin = case_when(
explscore <= med_explscore ~ "low",
explscore > med_explscore ~ "high")) %>%
left_join(pr)
## Joining, by = c("indiv", "habitat", "domscore", "urbanscore", "explscore")
paper2 %>% group_by(expl_bin, status) %>% count()
## # A tibble: 4 × 3
## # Groups: expl_bin, status [4]
## expl_bin status n
## <chr> <dbl> <int>
## 1 high 0 14
## 2 high 1 16
## 3 low 0 17
## 4 low 1 13
#there are enough non-censored observation
#examining km curves stratified by expl_bin:
paper2.km3 = survfit(Surv(paper2$time,paper2$status) ~ strata(paper2$expl_bin))
ggsurvplot(fit = paper2.km3, data = paper2, risk.table = F, conf.int = F, fun = "event", legend = "right", palette = 'Paired') +
labs(title = "Paper ripping by explorer score", y = "Probability of Ripping Paper")
#pretty similar km curves....
survdiff(Surv(time,status) ~ expl_bin, data = paper2)
## Call:
## survdiff(formula = Surv(time, status) ~ expl_bin, data = paper2)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## expl_bin=high 30 16 13.4 0.495 0.946
## expl_bin=low 30 13 15.6 0.426 0.946
##
## Chisq= 0.9 on 1 degrees of freedom, p= 0.3
# not statistically significant difference between high and low exploration scores
How does habitat affect task completion?
#urban or rural:
paper2.km3 = survfit(Surv(paper2$time,paper2$status) ~ strata(paper2$habitat))
ggsurvplot(fit = paper2.km3, data = paper2, risk.table = F, conf.int = F, fun = "event", legend = "right", palette = 'Paired') +
labs(title = "Paper Shredding: Rural vs. Urban Birds")
#also very similar curves
survdiff(Surv(paper2$time,paper2$status) ~ paper2$habitat)
## Call:
## survdiff(formula = Surv(paper2$time, paper2$status) ~ paper2$habitat)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## paper2$habitat=rural 30 18 16.6 0.112 0.295
## paper2$habitat=urban 30 11 12.4 0.151 0.295
##
## Chisq= 0.3 on 1 degrees of freedom, p= 0.6
How about age?
paper2.km4 = survfit(Surv(paper2$time,paper2$status) ~ strata(paper2$age))
ggsurvplot(fit = paper2.km4, data = paper2, risk.table = F, conf.int = F, fun = "event", legend = "right", palette = 'Paired') +
labs(title = "Paper Shredding: Adult vs. Juvenile Birds")
#also very similar curves
survdiff(Surv(paper2$time,paper2$status) ~ paper2$age)
## Call:
## survdiff(formula = Surv(paper2$time, paper2$status) ~ paper2$age)
##
## n=56, 4 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## paper2$age=adult 31 15 15 0.000110 0.000266
## paper2$age=juvenile 25 11 11 0.000149 0.000266
##
## Chisq= 0 on 1 degrees of freedom, p= 1