Introduction

This is a demonstration of using survival analysis techniques to investigate how factors such as age, dominance ranking, and environment affect a chickadee’s ability to complete a task. I have used the data to create lognormal distribution fits and a weibull regression model for each trial.

Study & Data Source:

Prasher, Sanjay; Evans, Julian C.; Thompson, Megan J.; Morand-Ferron, Julie (2019), Data from: Characterizing innovators: ecological and individual predictors of problem-solving performance, Dryad, Dataset, https://doi.org/10.5061/dryad.s83d4n1

Abstract from Study:

“Behavioural innovation, the use of new behaviours or existing ones in novel contexts, can have important ecological and evolutionary consequences for animals. An understanding of these consequences would be incomplete without considering the traits that predispose certain individuals to exhibit innovative behaviour. Several individual and ecological variables are hypothesized to affect innovativeness, but empirical studies show mixed results. We examined the effects of dominance rank, exploratory personality, and urbanisation on the innovativeness of wild-caught black-capped chickadees using a survival analysis of their performance in two problem-solving tasks. Additionally, we provide one of the first investigations of the predictors of persistence in a problem-solving context. For lever pulling, we found a trend for dominants to outperform subordinates, particularly in rural birds, which did not align with predictions from the necessity drives innovation hypothesis. When examining possible explanations for this trend we found that older chickadees outperformed younger birds. This follow-up analysis also revealed a positive effect of exploratory personality on the lever-pulling performance of chickadees. Our results suggest that experience may foster innovation in certain circumstances, for instance via the application of previously-acquired information or skills to a novel problem. As we found different predictors for both tasks, this suggests that task characteristics influence the innovative propensity of individuals, and that their effects should be investigated experimentally.”

Primary Questions of Interest:

What was the overall time to solving the problems (lever pulling, paper ripping, and persistence & repeatability) for chickadees? What covariates/characteristics were most closely predictive of whether a chickadee figured out the problems?

Loading Packages:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(skimr)
library(survival)
library(survminer)
## Loading required package: ggpubr
## 
## Attaching package: 'survminer'
## 
## The following object is masked from 'package:survival':
## 
##     myeloma
library(fitdistrplus)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Reading in datasets:

lever = read_csv('LeverPullingPerformance.csv')
## Rows: 2779 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): indiv, site, habitat, age, sex
## dbl (9): latency, status, tstart, tstop, contacts, domscore, urbanscore, exp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
paper = read_csv('PaperRippingPerformance.csv')
## Rows: 3431 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): indiv, site, habitat
## dbl (9): latency, status, tstart, tstop, contacts, domscore, urbanscore, exp...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pr = read_csv('Persistence&Repeatability.csv')
## Rows: 70 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): site, indiv, age, sex, habitat, lpsolve, prsolve
## dbl (10): capturedate, urbanscore, explscore, domscore, lplatency, lpstatus,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

DATASET 1: Lever

Variables:

indiv: individual band number identification latency: latency to solve or become censored in the lever-pulling trials status: censoring status of latency to solve the lever-pulling task used for cox regression (0 = censored, 1 = uncensored) tstart: start time (in seconds) of each time interval used for the extended cox regression tstop: stop time (in seconds) of each time interval used for the extended cox regression contacts: number of contacts made with the task in each time interval domscore: dominance score based on interactions between birds at RFID feeders. urbanscore: urbanization score generated by PCA analysis used to infer degree of urbanization at each site. explscore: exploration score generated by PCA analysis. solve: indicator of whether the task was solved in each time interval (0 = unsolved, 1 = solved) site: code of each field site. habitat: rural (urban score < 0) or urban (urban score > 0) habitat type. age: age class – adult or juvenile. sex: female or male.

Preliminary EDA of lever dataset:

skim(lever)
Data summary
Name lever
Number of rows 2779
Number of columns 14
_______________________
Column type frequency:
character 5
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
indiv 0 1.00 10 10 0 65 0
site 0 1.00 2 3 0 7 0
habitat 0 1.00 5 5 0 2 0
age 268 0.90 5 8 0 2 0
sex 472 0.83 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latency 0 1.00 5711.41 2240.41 11.00 3924.00 6924.00 7577.00 7834.00 ▁▁▃▁▇
status 0 1.00 0.25 0.43 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
tstart 0 1.00 2841.27 2103.15 0.00 1020.00 2460.00 4320.00 7800.00 ▇▆▅▃▂
tstop 0 1.00 2901.27 2103.15 60.00 1080.00 2520.00 4380.00 7860.00 ▇▆▅▃▂
contacts 66 0.98 0.73 4.44 0.00 0.00 0.00 0.00 68.00 ▇▁▁▁▁
domscore 1192 0.57 0.49 0.27 0.07 0.29 0.43 0.67 1.00 ▆▇▃▅▆
urbanscore 0 1.00 -0.40 1.77 -2.00 -2.00 -1.62 1.96 2.48 ▇▁▂▁▃
explscore 0 1.00 -0.45 2.33 -2.07 -1.78 -1.38 -0.18 9.21 ▇▁▁▁▁
solve 0 1.00 0.01 0.12 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
summary(lever)
##     indiv              latency         status           tstart    
##  Length:2779        Min.   :  11   Min.   :0.0000   Min.   :   0  
##  Class :character   1st Qu.:3924   1st Qu.:0.0000   1st Qu.:1020  
##  Mode  :character   Median :6924   Median :0.0000   Median :2460  
##                     Mean   :5711   Mean   :0.2465   Mean   :2841  
##                     3rd Qu.:7577   3rd Qu.:0.0000   3rd Qu.:4320  
##                     Max.   :7834   Max.   :1.0000   Max.   :7800  
##                                                                   
##      tstop         contacts         domscore        urbanscore     
##  Min.   :  60   Min.   : 0.000   Min.   :0.0714   Min.   :-1.9999  
##  1st Qu.:1080   1st Qu.: 0.000   1st Qu.:0.2857   1st Qu.:-1.9999  
##  Median :2520   Median : 0.000   Median :0.4286   Median :-1.6219  
##  Mean   :2901   Mean   : 0.735   Mean   :0.4926   Mean   :-0.3982  
##  3rd Qu.:4380   3rd Qu.: 0.000   3rd Qu.:0.6667   3rd Qu.: 1.9574  
##  Max.   :7860   Max.   :68.000   Max.   :1.0000   Max.   : 2.4818  
##                 NA's   :66       NA's   :1192                      
##    explscore           solve             site             habitat         
##  Min.   :-2.0694   Min.   :0.00000   Length:2779        Length:2779       
##  1st Qu.:-1.7765   1st Qu.:0.00000   Class :character   Class :character  
##  Median :-1.3848   Median :0.00000   Mode  :character   Mode  :character  
##  Mean   :-0.4529   Mean   :0.01367                                        
##  3rd Qu.:-0.1824   3rd Qu.:0.00000                                        
##  Max.   : 9.2119   Max.   :1.00000                                        
##                                                                           
##      age                sex           
##  Length:2779        Length:2779       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 
head(lever)
## # A tibble: 6 × 14
##   indiv  latency status tstart tstop conta…¹ domsc…² urban…³ expls…⁴ solve site 
##   <chr>    <dbl>  <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>
## 1 2780-…    3699      0      0    60       0   0.833    1.96  -0.728     0 FRP  
## 2 2780-…    3699      0     60   120      19   0.833    1.96  -0.728     0 FRP  
## 3 2780-…    3699      0    120   180      21   0.833    1.96  -0.728     0 FRP  
## 4 2780-…    3699      0    180   240       4   0.833    1.96  -0.728     0 FRP  
## 5 2780-…    3699      0    240   300       2   0.833    1.96  -0.728     0 FRP  
## 6 2780-…    3699      0    300   360       0   0.833    1.96  -0.728     0 FRP  
## # … with 3 more variables: habitat <chr>, age <chr>, sex <chr>, and abbreviated
## #   variable names ¹​contacts, ²​domscore, ³​urbanscore, ⁴​explscore
#it looks like each row is a 60 second interval, so for each of 65 birds that are in this dataset there are one or multiple rows, depending on how long it took them to solve the lever problem.

lever %>% 
  group_by(indiv) %>% 
    count()
## # A tibble: 65 × 2
## # Groups:   indiv [65]
##    indiv          n
##    <chr>      <int>
##  1 2730-71136     1
##  2 2730-71137     4
##  3 2780-73204     6
##  4 2780-73248     2
##  5 2780-73253     1
##  6 2780-73254     5
##  7 2780-73255    50
##  8 2780-73256     1
##  9 2780-73288    62
## 10 2780-73289    22
## # … with 55 more rows
lever %>% 
  filter(solve == 1) %>% 
    group_by(indiv) %>% 
      summarize(time = max(tstop), status)
## # A tibble: 38 × 3
##    indiv       time status
##    <chr>      <dbl>  <dbl>
##  1 2730-71136    60      1
##  2 2730-71137   240      1
##  3 2780-73204   360      1
##  4 2780-73248   120      1
##  5 2780-73253    60      1
##  6 2780-73254   300      1
##  7 2780-73256    60      1
##  8 2780-73289  1320      1
##  9 2780-73299  6780      1
## 10 2780-73300    60      1
## # … with 28 more rows
#38 of the 65 birds in dataset solved the problem 

lever %>% 
  filter(status == 0) %>%  
    group_by(indiv) %>% 
      summarize(time = max(tstop)) 
## # A tibble: 27 × 2
##    indiv       time
##    <chr>      <dbl>
##  1 2780-73255  3000
##  2 2780-73288  3720
##  3 2780-73290  3720
##  4 2780-73292  3720
##  5 2780-73293  3960
##  6 2780-73295  3960
##  7 2780-73296  3960
##  8 2780-73298  3960
##  9 2780-73301   120
## 10 2780-73302  7860
## # … with 17 more rows
#27 of 65 birds in dataset did not solve the lever problem 

lever %>% 
  group_by(indiv) %>% 
    summarize(min_time = min(tstart))
## # A tibble: 65 × 2
##    indiv      min_time
##    <chr>         <dbl>
##  1 2730-71136        0
##  2 2730-71137        0
##  3 2780-73204        0
##  4 2780-73248        0
##  5 2780-73253        0
##  6 2780-73254        0
##  7 2780-73255        0
##  8 2780-73256        0
##  9 2780-73288        0
## 10 2780-73289        0
## # … with 55 more rows
#all the birds start at 0 
#this summary table will be used to for our survival analysis so there is 1 row per bird with the time-to-event (pulling the lever) as variable "time"
lever2 = lever %>% 
    group_by(status, indiv, habitat,age) %>% 
      summarize (time = max(tstop), 
                 contacts = sum(contacts), 
                 domscore = unique(domscore), 
                 urbanscore = unique(urbanscore), 
                 explscore = unique(explscore), 
                 solve = max(solve))
## `summarise()` has grouped output by 'status', 'indiv', 'habitat'. You can
## override using the `.groups` argument.
#1st km curve for all birds in trial (no grouping)                                       
lever2.km = survfit(Surv(lever2$time,lever2$status) ~ 1)

#km table 
summary(lever2.km)
## Call: survfit(formula = Surv(lever2$time, lever2$status) ~ 1)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    60     65       9    0.862  0.0428        0.782        0.950
##   120     55       5    0.783  0.0513        0.689        0.891
##   180     49       2    0.751  0.0540        0.653        0.865
##   240     47       4    0.687  0.0581        0.582        0.811
##   300     42       4    0.622  0.0611        0.513        0.754
##   360     38       1    0.605  0.0616        0.496        0.739
##   420     37       1    0.589  0.0621        0.479        0.724
##   540     35       1    0.572  0.0626        0.462        0.709
##   900     34       1    0.555  0.0629        0.445        0.694
##  1020     33       2    0.522  0.0635        0.411        0.662
##  1080     31       1    0.505  0.0636        0.394        0.646
##  1320     30       1    0.488  0.0637        0.378        0.630
##  3660     25       1    0.469  0.0641        0.358        0.613
##  3960     21       1    0.446  0.0648        0.336        0.593
##  4020     16       1    0.418  0.0665        0.306        0.571
##  5400     14       1    0.389  0.0681        0.276        0.548
##  6780     13       1    0.359  0.0691        0.246        0.523
##  6960     12       1    0.329  0.0695        0.217        0.498
print(lever2.km)
## Call: survfit(formula = Surv(lever2$time, lever2$status) ~ 1)
## 
##       n events median 0.95LCL 0.95UCL
## [1,] 65     38   1320     360    6960
#transforming y to be 1 -S(t) for better interpretability given type of event (solving a problem)
ggsurvplot(fit = lever2.km, data = lever2, risk.table = F, fun = "event", conf.int = T, surv.median.line = "hv") +
    labs(title = "Survival Analysis of Chickadees Solving Lever Problem",
       y = "Probability of Pulling Lever", 
       x = "Time (seconds)")

Binning bird characteristic variables for survival analysis

#binning domscore: 
lever2 = lever2 %>%  mutate(dom_bin = case_when(
                                          domscore <= .50 ~ "low", 
                                          domscore > .50 ~ "high"))

#distribution of explscore: 
ggplot(data = lever2, aes(x = explscore)) + geom_histogram(binwidth = 3)

#there is quite the outlier! That's one intrepid bird.

#binning explscore: 
lever2 = lever2 %>% mutate(expl_bin = case_when(
                                explscore <= -.89 ~ "low", 
                                explscore > -.89 ~ "high"))

Age

#creating new km table, grouping by type of habitat and age of bird: 
lever2.km2 = survfit(Surv(lever2$time,lever2$status) ~ lever2$age) 

ggsurvplot(fit = lever2.km2, data = lever2, risk.table = F, conf.int = F, fun = "event", linetype = "strata", legend = "right", palette="Paired") + 
  labs(title = "Adults are better at Lever Pulling Task",
       y = "Probability of Pulling Lever", 
       x = "Time (seconds)")

#log rank test
survdiff(Surv(lever2$time,lever2$status) ~ lever2$age)
## Call:
## survdiff(formula = Surv(lever2$time, lever2$status) ~ lever2$age)
## 
## n=58, 7 observations deleted due to missingness.
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## lever2$age=adult    33       23     17.2      1.97      4.35
## lever2$age=juvenile 25       11     16.8      2.01      4.35
## 
##  Chisq= 4.3  on 1 degrees of freedom, p= 0.04
# p = .04 

Dominance Score affects bird’s task performance

#km table, grouping by how dominant and exploratory the birds are: 
lever2.km3 = survfit(Surv(lever2$time,lever2$status) ~ strata(lever2$dom_bin))

ggsurvplot(fit = lever2.km3, data = lever2, risk.table = F, conf.int = F, fun = "event", linetype = "strata", legend = "right", palette="Paired") + 
  labs(title = "The More Dominant are Better at Pulling Lever",
       y = "Probability of Pulling Lever", 
       x = "Time (seconds)")

#log rank test 
survdiff(Surv(lever2$time,lever2$status) ~ lever2$dom_bin) #p = .09
## Call:
## survdiff(formula = Surv(lever2$time, lever2$status) ~ lever2$dom_bin)
## 
## n=39, 26 observations deleted due to missingness.
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## lever2$dom_bin=high 22       15       11      1.41      2.94
## lever2$dom_bin=low  17        8       12      1.31      2.94
## 
##  Chisq= 2.9  on 1 degrees of freedom, p= 0.09
survdiff(Surv(lever2$time,lever2$status) ~ lever2$domscore) #p = .05
## Call:
## survdiff(formula = Surv(lever2$time, lever2$status) ~ lever2$domscore)
## 
## n=39, 26 observations deleted due to missingness.
## 
##                             N Observed Expected (O-E)^2/E (O-E)^2/V
## lever2$domscore=0.071428571 1        0    0.670  6.70e-01  7.29e-01
## lever2$domscore=0.125       1        0    0.670  6.70e-01  7.29e-01
## lever2$domscore=0.142857143 2        2    1.290  3.90e-01  4.38e-01
## lever2$domscore=0.166666667 1        0    0.742  7.42e-01  8.06e-01
## lever2$domscore=0.25        2        1    1.246  4.85e-02  5.42e-02
## lever2$domscore=0.285714286 1        0    1.162  1.16e+00  1.30e+00
## lever2$domscore=0.333333333 2        1    1.027  7.36e-04  8.15e-04
## lever2$domscore=0.357142857 1        0    0.670  6.70e-01  7.29e-01
## lever2$domscore=0.375       1        0    1.162  1.16e+00  1.30e+00
## lever2$domscore=0.428571429 1        1    0.996  2.02e-05  2.23e-05
## lever2$domscore=0.444444444 1        0    1.162  1.16e+00  1.30e+00
## lever2$domscore=0.5         3        3    1.156  2.94e+00  3.31e+00
## lever2$domscore=0.571428571 2        1    0.374  1.05e+00  1.21e+00
## lever2$domscore=0.625       1        1    0.617  2.37e-01  2.58e-01
## lever2$domscore=0.642857143 1        1    0.567  3.30e-01  3.61e-01
## lever2$domscore=0.666666667 2        0    1.904  1.90e+00  2.17e+00
## lever2$domscore=0.714285714 1        1    0.128  5.93e+00  6.80e+00
## lever2$domscore=0.75        2        2    0.787  1.87e+00  2.09e+00
## lever2$domscore=0.785714286 1        1    0.433  7.42e-01  8.14e-01
## lever2$domscore=0.833333333 1        0    0.742  7.42e-01  8.06e-01
## lever2$domscore=0.857142857 1        0    1.162  1.16e+00  1.30e+00
## lever2$domscore=0.875       1        1    0.187  3.53e+00  3.95e+00
## lever2$domscore=0.928571429 2        2    0.981  1.06e+00  1.17e+00
## lever2$domscore=1           7        5    3.164  1.06e+00  1.37e+00
## 
##  Chisq= 35.5  on 23 degrees of freedom, p= 0.05
#This large p values are probably due to the very small sample size!

Initial Findings:

From the survival curves above, the data suggests adult, rural birds with a high dominance and exploration scores are the most likely to complete the lever pulling task. It appears that age and dominance play a large role in determining likelihood in solving the problem.

#preparing lever2 dataset for fitdistcens: 
lever3 = lever2 %>% mutate(left= ifelse(time == 0,time+.001,time), 
                  right= ifelse(status == 1,left, NA))

#weibull
weibullFit.lever3=fitdistcens(data.frame(lever3[,c("left","right")]), dist = "weibull")
summary(weibullFit.lever3)$estimate
##        shape        scale 
##    0.4676573 5041.4391445
plot(weibullFit.lever3)

#make weibull act expo using fix.arg shape = 1 ?
#exponential 
#expFit.lever3 = fitdistcens(data.frame(lever3[,c("left","right")]), dist = "exp") #exponential does not work #summary(expFit.lever3)
#plot(expFit.lever3)

#lognnormal 
logFit.lever3 = fitdistcens(data.frame(lever3[,c("left","right")]), dist = "lnorm")
summary(logFit.lever3)$estimate
##  meanlog    sdlog 
## 7.536097 2.600340
plot(logFit.lever3)

#assessing goodness of fits
weibullFit.lever3$aic
## [1] 674.9868
logFit.lever3$aic
## [1] 665.9616
#gammaFit.lever3 = fitdistcens(data.frame(lever3[,c("left","right")]), dist = "gamma")
#gamma doesn't work either! 
   
#our lognormal distribution is the best fit 

cdfcompcens(list(logFit.lever3,weibullFit.lever3),
            legendtext=c("Lognormal","Weibull"),
            xlab="Time to Pull Lever (seconds)",
            plotstyle = "ggplot")

qqcompcens(list(logFit.lever3,weibullFit.lever3), 
           legendtext=c("Lognormal","Weibull")
           ,plotstyle = "ggplot")

Fitting parametric lognormal curve

meanlog.lever = summary(logFit.lever3)$estimate[1]
sdlog.lever = summary(logFit.lever3)$estimate[2]

lnorm_values = rlnorm(1000, meanlog = meanlog.lever, sdlog = sdlog.lever)

modelfit.lever = tibble(time = pmin(8000,lnorm_values)) %>%                        
                      mutate(status = ifelse(time >= 8000, 0,1), dataType = "model") 

observed.lever = lever2 %>% ungroup %>% dplyr::select(time, status) %>% mutate(dataType = "observed") 

dist.lever = bind_rows(modelfit.lever,observed.lever)

km.lever =  survfit(Surv(dist.lever$time,dist.lever$status) ~ strata(dist.lever$dataType))

ggsurvplot(fit = km.lever, data = dist.lever, risk.table = F, conf.int = T
           , fun = "event"
           , linetype = "strata"
           , palette="Dark2"
           , legend = "right"
           , surv.median.line = 'hv') + 
  labs(title = "Lognormal Fit of Chickadee Lever Pulling Probability")

#What percent of birds pulled the lever within 1800 second (a half hour?)
plnorm(1800, meanlog = meanlog.lever, sdlog = sdlog.lever)
## [1] 0.4937783
#.49

#at what time would half have the birds have pulled the lever? 
qlnorm(0.5, meanlog = meanlog.lever, sdlog = sdlog.lever, lower.tail = TRUE )
## [1] 1874.5
#1874.5 

Weibull Regression

Weibull regression modeling 100 birds each for 2 different levels of dominance

survObject=Surv(lever3$left,lever3$right,type="interval2")

weibullMod = survreg(survObject~lever3$domscore, dist = "weibull")
summary(weibullMod)
## 
## Call:
## survreg(formula = survObject ~ lever3$domscore, dist = "weibull")
##                  Value Std. Error     z       p
## (Intercept)     10.267      0.999 10.28 < 2e-16
## lever3$domscore -3.085      1.360 -2.27 0.02331
## Log(scale)       0.584      0.175  3.33 0.00087
## 
## Scale= 1.79 
## 
## Weibull distribution
## Loglik(model)= -202.7   Loglik(intercept only)= -205.4
##  Chisq= 5.43 on 1 degrees of freedom, p= 0.02 
## Number of Newton-Raphson Iterations: 5 
## n=39 (26 observations deleted due to missingness)
b0 =  10.267
b1 = -3.085 

#make random dominance scores for birds: 
random_numbers <- function(n, min, max) {
  random_values <- runif(n, min = min, max = max)
  return(random_values)
}

lows = random_numbers(1000,.01,.5)
highs = random_numbers(1000,.51,.99)

sigma  = summary(weibullMod)$scale

#choose domscores of .25 and .75 to simulate 
low_dom_birds = rweibull(1000, shape = 1/sigma, scale=exp(b0 + b1*lows))
#dom50_rural_adult = rweibull(100, shape = 1/sigma, scale=exp(b0 + b1*.50))
high_dom_birds = rweibull(1000, shape = 1/sigma, scale=exp(b0 + b1*highs))

modelfit = tibble(domscore = c(lows,highs), 
                  time = pmin(8000,c(low_dom_birds,high_dom_birds))) %>%                        
                      mutate(status = ifelse(time == 8000, 0,1), dataType = "model") %>% 
                        mutate(dom_bin = case_when( 
                                  domscore <= .50 ~ "low", 
                                  domscore > .50 ~ "high"))

lever4 = lever3 %>% dplyr::select(domscore,dom_bin,time, status) 
## Adding missing grouping variables: `indiv`, `habitat`
lever4 = lever4 %>% ungroup %>% dplyr::select(-indiv, -habitat) %>% 
  mutate(dataType = "observed")

together = bind_rows(lever4, modelfit)

km7 = survfit(Surv(together$time, together$status) ~ strata(together$dom_bin) + strata(together$dataType))


ggsurvplot(fit = km7, data = together, fun = "event", conf.int = F, legend = "right", palette="Paired")+ 
  labs(title = "Weibull Regression of Chickadee Lever Pulling Probability")

Diagnostics

#diagnostics 
devresids=residuals(weibullMod,type="deviance")

#lever3 %>%
    #mutate(devresids=devresids) %>%
    #ggplot(aes(y=devresids,x=age,fill=age)) +
       # geom_boxplot() +
        #geom_hline(yintercept = 0,linetype="dotted") +
        #labs(title="Deviance Residuals for logNormal Model", x="age",y="Deviance Residuals") +
        #coord_flip() +
        #theme_minimal()

#plotting residuals in throwing error due to presence of NAs, I believe. 


#these models may not be most appropriate for this dataset

Cox proportional hazards model: Explanation of use of cox proportional hazards regression from study:

“When analysing the predictors of problem-solving performance, we conducted an extended cox proportional hazards regression. This is a semi-parametric survival analysis approach that makes no assumptions about the distribution of the response variable [49]. We used this approach as an alternative to assigning capped latencies to individuals that did not solve a problem, which would suggest that the bird solved at that time and may not allow the data to meet assumptions of normality for commonly used regression analyses. Using censored observations allows us to avoid those assumptions and incorporate our limited knowledge of eachbird’s performance on a task (e.g. we only know that an unsuccessful bird was not able to solvea task in the time given)”

In other words, given the quantity of birds that did not complete the task and are therefore censored, using a semi-parametric regression model

cox_mod = coxph(Surv(time, status) ~ domscore + explscore + habitat + age, data = lever3)
summary(cox_mod)
## Call:
## coxph(formula = Surv(time, status) ~ domscore + explscore + habitat + 
##     age, data = lever3)
## 
##   n= 35, number of events= 20 
##    (30 observations deleted due to missingness)
## 
##                 coef exp(coef) se(coef)      z Pr(>|z|)  
## domscore      1.8330    6.2527   0.9292  1.973   0.0485 *
## explscore     0.1845    1.2027   0.1018  1.813   0.0698 .
## habitaturban -0.8638    0.4216   0.5757 -1.500   0.1335  
## agejuvenile  -0.3058    0.7365   0.5657 -0.541   0.5888  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## domscore        6.2527     0.1599    1.0118    38.639
## explscore       1.2027     0.8315    0.9852     1.468
## habitaturban    0.4216     2.3722    0.1364     1.303
## agejuvenile     0.7365     1.3577    0.2430     2.232
## 
## Concordance= 0.706  (se = 0.077 )
## Likelihood ratio test= 10.56  on 4 df,   p=0.03
## Wald test            = 10.38  on 4 df,   p=0.03
## Score (logrank) test = 11.11  on 4 df,   p=0.03
ggsurvplot(survfit(cox_mod), fun = "event", data = lever3, surv.median.line = "hv", palette="Paired")+ 
  labs(title = "Cox Mod of Chickadee Lever Pulling Probability")

#this plot makes sense in comparing the median value of 
median(lever3$time)
## [1] 1020

DATASET 2: Paper

indiv: individual band number identification. latency: latency to solve or become censored in the paper-ripping trials. status: censoring status of latency to solve the paper-ripping task used for cox regression (0 = censored, 1 = uncensored). tstart: start time (in seconds) of each time interval used for the extended cox regression. tstop: stop time (in seconds) of each time interval used for the extended cox regression. contacts: number of contacts made with the task in each time interval. domscore: dominance score based on interactions between birds at RFID feeders. urbanscore: urbanization score generated by PCA analysis used to infer degree of urbanization at each site. explscore: exploration score generated by PCA analysis. solve: indicator of whether the task was solved in each time interval (0 = unsolved, 1 = solved). site: code of each field site. habitat: rural (urban score < 0) or urban (urban score > 0) habitat type.

Getting acquainted with the Paper trial dataset

paper
## # A tibble: 3,431 × 12
##    indiv latency status tstart tstop conta…¹ domsc…² urban…³ expls…⁴ solve site 
##    <chr>   <dbl>  <dbl>  <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl> <chr>
##  1 2780…      45      1      0    60      17   0.833    1.96  -0.728     1 FRP  
##  2 2780…    3751      0      0    60       0   0.5      1.96  -1.58      0 FRP  
##  3 2780…    3751      0     60   120       3   0.5      1.96  -1.58      0 FRP  
##  4 2780…    3751      0    120   180       0   0.5      1.96  -1.58      0 FRP  
##  5 2780…    3751      0    180   240       3   0.5      1.96  -1.58      0 FRP  
##  6 2780…    3751      0    240   300       2   0.5      1.96  -1.58      0 FRP  
##  7 2780…    3751      0    300   360       3   0.5      1.96  -1.58      0 FRP  
##  8 2780…    3751      0    360   420       0   0.5      1.96  -1.58      0 FRP  
##  9 2780…    3751      0    420   480       0   0.5      1.96  -1.58      0 FRP  
## 10 2780…    3751      0    480   540       0   0.5      1.96  -1.58      0 FRP  
## # … with 3,421 more rows, 1 more variable: habitat <chr>, and abbreviated
## #   variable names ¹​contacts, ²​domscore, ³​urbanscore, ⁴​explscore
skim(paper)
Data summary
Name paper
Number of rows 3431
Number of columns 12
_______________________
Column type frequency:
character 3
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
indiv 0 1 10 10 0 60 0
site 0 1 2 3 0 7 0
habitat 0 1 5 5 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latency 0 1.00 5702.94 2064.27 18.00 3751.00 7079.00 7373.00 8098.00 ▁▁▅▂▇
status 0 1.00 0.19 0.39 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
tstart 0 1.00 2834.16 2036.76 0.00 1140.00 2520.00 4260.00 8040.00 ▇▇▅▃▂
tstop 0 1.00 2894.16 2036.76 60.00 1200.00 2580.00 4320.00 8100.00 ▇▇▅▃▂
contacts 0 1.00 0.26 1.81 0.00 0.00 0.00 0.00 50.00 ▇▁▁▁▁
domscore 1676 0.51 0.57 0.30 0.07 0.33 0.50 0.88 1.00 ▅▅▅▃▇
urbanscore 0 1.00 -0.47 1.76 -2.00 -1.89 -1.62 1.96 2.48 ▇▁▂▁▃
explscore 0 1.00 0.07 2.53 -2.07 -1.70 -1.11 0.84 9.21 ▇▂▁▁▁
solve 0 1.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
paper2 = paper %>% group_by(status, indiv, habitat) %>% summarize (time = max(tstop), contacts = sum(contacts), domscore = unique(domscore), urbanscore = unique(urbanscore), explscore = unique(explscore), solve = max(solve))
## `summarise()` has grouped output by 'status', 'indiv'. You can override using
## the `.groups` argument.
#general km curve all birds (no strata):

paper2.km = survfit(Surv(paper2$time, paper2$status) ~1)

ggsurvplot(fit = paper2.km, data = paper2, risk.table = F, conf.int = T, fun = "event", surv.median.line = "hv", palette = 'Paired') + 
  labs(title = "KM Curve of Chickadee Paper Shredding Probability")

Parametric fits

paper3 = paper2 %>% mutate(left= ifelse(time == 0,time+.001,time), 
                  right= ifelse(status == 1,left, NA))

#weibull
weibullFit.paper3=fitdistcens(data.frame(paper3[,c("left","right")]), dist = "weibull")
summary(weibullFit.paper3)$estimate
##        shape        scale 
## 4.896717e-01 1.151936e+04
plot(weibullFit.paper3)

#exponential - does not work! Should coerce weibull to act like exp with fixed arg shape = 1
#expFit.paper3 = fitdistcens(data.frame(paper3[,c("left","right")]), dist = "exp") 
#summary(expFit.paper3)
#plot(expFit.paper3)

#lognnormal 
logFit.paper3 = fitdistcens(data.frame(paper3[,c("left","right")]), dist = "lnorm")
summary(logFit.paper3)$estimate
##  meanlog    sdlog 
## 8.545151 2.782355
plot(logFit.paper3)

weibullFit.paper3$aic
## [1] 551.7791
logFit.paper3$aic
## [1] 546.7332
#log wins again! 

Plotting both non-parametric and parametric curves

#getting my values from lnorm fit: 

meanlog = summary(logFit.paper3)$estimate[1]
sdlog = summary(logFit.paper3)$estimate[2]

#generating values from lnorm distribution
lnorm_values = rlnorm(1000, meanlog = meanlog, sdlog = sdlog)

modelfit.paper = tibble(time = pmin(8000,lnorm_values)) %>%                        
                      mutate(status = ifelse(time == 8000, 0,1), dataType = "model") 

observed.paper = paper2 %>% ungroup %>% dplyr::select(time, status) %>% mutate(dataType = "observed") 

dist.paper = bind_rows(modelfit.paper,observed.paper)

km.paper =  survfit(Surv(dist.paper$time,dist.paper$status) ~ strata(dist.paper$dataType))

ggsurvplot(fit = km.paper, data = dist.paper, risk.table = F, conf.int = T
           , fun = "event"
           , palette = 'Dark2'
           , linetype = "strata"
           , legend = "right"
           , surv.median.line = 'hv') + 
  labs(title = "Lognormal Fit of Chickadee Paper Ripping Probability", y = "Probability of Ripping Paper")

qlnorm(.50,meanlog = meanlog, sdlog = sdlog)
## [1] 5141.76

Dominance characteristic did not have a huge affect on ability to solve problem:

paper2 = paper2 %>%  mutate(dom_bin = case_when(domscore <= .5 ~ "low", 
                                                domscore > .5 ~ "high"))

paper2.km2 = survfit(Surv(paper2$time,paper2$status) ~ paper2$dom_bin)

ggsurvplot(fit = paper2.km2, data = paper2, risk.table = F, conf.int = F, fun = "event", palette = 'Paired') + 
  labs(title = "Does more dominant mean better at paper test?", y = "Probability of Ripping Paper") 

#log rank test 
survdiff(Surv(paper2$time,paper2$status) ~ paper2$dom_bin)
## Call:
## survdiff(formula = Surv(paper2$time, paper2$status) ~ paper2$dom_bin)
## 
## n=36, 24 observations deleted due to missingness.
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## paper2$dom_bin=high 20       14     9.42      2.23      4.42
## paper2$dom_bin=low  16        6    10.58      1.98      4.42
## 
##  Chisq= 4.4  on 1 degrees of freedom, p= 0.04
paper2 %>% group_by(dom_bin, status) %>% count()
## # A tibble: 6 × 3
## # Groups:   dom_bin, status [6]
##   dom_bin status     n
##   <chr>    <dbl> <int>
## 1 high         0     6
## 2 high         1    14
## 3 low          0    10
## 4 low          1     6
## 5 <NA>         0    15
## 6 <NA>         1     9
#there are not enough observations in each group to create parametric curve on how dominance affects ability to rip paper to get at tasty worms

looking at expl_score

paper2 %>% ggplot() + geom_histogram(aes(x = explscore))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

med_explscore = median(paper2$explscore)

#binning expl_score 
paper2 = paper2 %>% mutate(expl_bin = case_when(
                                explscore <= med_explscore ~ "low", 
                                explscore > med_explscore ~ "high")) %>% 
        left_join(pr)
## Joining, by = c("indiv", "habitat", "domscore", "urbanscore", "explscore")
paper2 %>% group_by(expl_bin, status) %>% count()
## # A tibble: 4 × 3
## # Groups:   expl_bin, status [4]
##   expl_bin status     n
##   <chr>     <dbl> <int>
## 1 high          0    14
## 2 high          1    16
## 3 low           0    17
## 4 low           1    13
#there are enough non-censored observation 

#examining km curves stratified by expl_bin: 
paper2.km3 = survfit(Surv(paper2$time,paper2$status) ~ strata(paper2$expl_bin))

ggsurvplot(fit = paper2.km3, data = paper2, risk.table = F, conf.int = F, fun = "event", legend = "right", palette = 'Paired') + 
  labs(title = "Paper ripping by explorer score", y = "Probability of Ripping Paper")

#pretty similar km curves....

survdiff(Surv(time,status) ~  expl_bin, data = paper2)
## Call:
## survdiff(formula = Surv(time, status) ~ expl_bin, data = paper2)
## 
##                N Observed Expected (O-E)^2/E (O-E)^2/V
## expl_bin=high 30       16     13.4     0.495     0.946
## expl_bin=low  30       13     15.6     0.426     0.946
## 
##  Chisq= 0.9  on 1 degrees of freedom, p= 0.3
# not statistically significant difference between high and low exploration scores 

How does habitat affect task completion?

#urban or rural: 
paper2.km3 = survfit(Surv(paper2$time,paper2$status) ~  strata(paper2$habitat))

ggsurvplot(fit = paper2.km3, data = paper2, risk.table = F, conf.int = F, fun = "event", legend = "right", palette = 'Paired') + 
  labs(title = "Paper Shredding: Rural vs. Urban Birds")

#also very similar curves 
survdiff(Surv(paper2$time,paper2$status) ~  paper2$habitat)
## Call:
## survdiff(formula = Surv(paper2$time, paper2$status) ~ paper2$habitat)
## 
##                       N Observed Expected (O-E)^2/E (O-E)^2/V
## paper2$habitat=rural 30       18     16.6     0.112     0.295
## paper2$habitat=urban 30       11     12.4     0.151     0.295
## 
##  Chisq= 0.3  on 1 degrees of freedom, p= 0.6

How about age?

paper2.km4 = survfit(Surv(paper2$time,paper2$status) ~  strata(paper2$age))

ggsurvplot(fit = paper2.km4, data = paper2, risk.table = F, conf.int = F, fun = "event", legend = "right", palette = 'Paired') + 
  labs(title = "Paper Shredding: Adult vs. Juvenile Birds")

#also very similar curves 
survdiff(Surv(paper2$time,paper2$status) ~  paper2$age)
## Call:
## survdiff(formula = Surv(paper2$time, paper2$status) ~ paper2$age)
## 
## n=56, 4 observations deleted due to missingness.
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## paper2$age=adult    31       15       15  0.000110  0.000266
## paper2$age=juvenile 25       11       11  0.000149  0.000266
## 
##  Chisq= 0  on 1 degrees of freedom, p= 1