Measurements have been taken on breeding pairs of bird species on 16 islands British over several decades. For each species, the data set contains: TIME, the average extinction time in the islands where they live, NESTING, the average number of nesting pairs, SIZE, the size of the species (large or small) and STATUS, the migratory status of the species (migrant or resident). The objective is to fit a model that describes the variation in time of extinction of bird species in terms of the covariates NESTING, SIZE and STATUS. The data set is available in the LearnBayes package from R.
Since two of the covariates are categorical with two levels, they can be represented by binary variables. In the birdextinct.txt data set, SIZE is encoded as 0-1 for large-small, while STATUS is encoded as 0-1 for migrant-resident.
After loading the LearnBayes library and loading the data into R, I built some initial graphics. Since the variable TIME is strongly asymmetric to the right,I transform it logarithmically creating the variable LOGTIME. Thus,I draw the LOGTIME variable against the three predictor variables. Given the the categorical variables SIZE and STATUS take only two values, I use the function jitter of R so that we can see any point overlap. Note that there is a positive relationship between the mean number of nesting pairs and the extinction time. However, there are 5 particular species whose points appear to vary from the general pattern. The labeling of these points is get with the dynamic function identify that labels the points as they are selected with the mouse(although in this post you can’t see it. To see it use the code in R or Rstudio) .
library(LearnBayes)
head(birdextinct,20)
## species time nesting size status
## 1 Sparrowhawk 3.030 1.000 0 1
## 2 Buzzard 5.464 2.000 0 1
## 3 Kestrel 4.098 1.210 0 1
## 4 Peregrine 1.681 1.125 0 1
## 5 Grey_partridge 8.850 5.167 0 1
## 6 Quail 1.493 1.000 0 0
## 7 Red-legged_partridge 7.692 2.750 0 1
## 8 Pheasant 3.846 5.630 0 1
## 9 Water_rail 16.667 3.000 0 1
## 10 Corncrake 4.219 4.670 0 0
## 11 Moorhen 8.130 4.056 0 1
## 12 Coot 5.000 1.000 0 1
## 13 Lapwing 7.299 6.960 0 0
## 14 Golden_plover 1.000 1.670 0 0
## 15 Ringed_plover 27.027 5.560 0 1
## 16 Curlew 3.106 2.830 0 0
## 17 Redshank 4.000 4.375 0 0
## 18 Snipe 16.129 4.125 0 0
## 19 Stock_dove 3.484 3.670 0 1
## 20 Rock_dove 37.037 8.330 0 1
data<-(birdextinct)
attach(birdextinct)
plot(nesting,time)
logtime=log(time)
plot(nesting,logtime)
#identify(nesting,logtime,label=species,n=5)
plot(jitter(size),logtime,xaxp=c(0,1,1));plot(jitter(status),logtime,xaxp=c(0,1,1))
I Select the most suitable linear model and analyze it. Based on the information obtained, I consider the regression model E (logTIMEi | x, Tetha) = Betha0 + Betha1NESTINGi + Betha2SIZEi + Betha3STATUSi First, I perform a classic least squares fit using the lm command
fit=lm(logtime~nesting+size+status,data=birdextinct,x=TRUE,y=TRUE)
summary(fit)
##
## Call:
## lm(formula = logtime ~ nesting + size + status, data = birdextinct,
## x = TRUE, y = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8410 -0.2932 -0.0709 0.2165 2.5167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.43087 0.20706 2.081 0.041870 *
## nesting 0.26501 0.03679 7.203 1.33e-09 ***
## size -0.65220 0.16667 -3.913 0.000242 ***
## status 0.50417 0.18263 2.761 0.007712 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6524 on 58 degrees of freedom
## Multiple R-squared: 0.5982, Adjusted R-squared: 0.5775
## F-statistic: 28.79 on 3 and 58 DF, p-value: 1.577e-11
I observe in the output that NESTING exerts a strong influence, the species with a large number of nesting pairs tend to have longer extinction, which means that these species are less likely to suffer extinction. The effects of SIZE and STATUS seem to be less significant, the big birds have shorter extinction times and residents have extinction times longer. Next, I use the blinreg function to generate samples from the distribution joint a posteriori of Betha and Sigma. The inputs of this function are the vector of values of the response variable, y, the design matrix of the linear regression fit, X, and the number of simulations, m. Let us observe that in the lm function the optional arguments x = T RUE, y = T RUE so that the design matrix and the response vector are available as components of the fit structure.
help(blinreg)
theta.sample=blinreg(fit$y,fit$x,5000)
The algorithm implemented in blinreg is based on the decomposition of the posterior joint distribution [Betha, Sigma^2| y] as the product [Sigma^2 | y] · [Betha|Sigma^ 2, y]. To simulate the pair (Sigma^2,Betha) is first obtained Sigma^2 from the inverse gamma of parameters ((n - k) / 2, S / 2)
S=sum(fit$residual^2)
shape=fit$df.residual/2
rate=S/2
sigma2=rigamma(1,shape,rate)
Then the regression vector Betha is simulated from the multivariate normal with mean and variance-covariance matrix VsubbethaSigma^2. Matrix Vsubbetha is obtained by dividing the variance-covariance matrix vcov of the least squares fit, between the root mean square error stored in variable MSE.
MSE=sum(fit$residuals^2)/fit$df.residual
vbeta=vcov(fit)/MSE
beta=rmnorm(1,mean=fit$coef,varcov=vbeta*sigma2)
The blinreg function returns two components, beta is a matrix of the simulations of posterior marginal distribution of Betha, and sigma is a vector of the simulations of the posterior marginal distribution of Sigma.
With the following R commands, the histograms of the coefficients are constructed Betha1, Betha2, Betha3 and the standard deviation simulated a posteriori.
par(mfrow=c(2,2))
hist(theta.sample$beta[,2],main='NESTIN',xlab=expression(beta[1]))
hist(theta.sample$beta[,3],main='SIZE',xlab=expression(beta[2]))
hist(theta.sample$beta[,4],main='STATUS',xlab=expression(beta[3]))
hist(theta.sample$sigma,main='ERROR SD',xlab=expression(sigma))
I can then calculate for each individual parameter its 5th percentiles, 50 and 95. For the output, we use the apply and quantile commands to summarize the values, from the simulations array of Betha, theta.sample$beta. In an analogous way proceed with simulations of Sigma.
apply(theta.sample$beta,2,quantile,c(.05,.5,.95))
## X(Intercept) Xnesting Xsize Xstatus
## 5% 0.07811657 0.2032353 -0.9332340 0.2025097
## 50% 0.43128326 0.2648177 -0.6538573 0.5102384
## 95% 0.76988374 0.3255634 -0.3766327 0.8188802
As expected, the a posteriori medians of the regression parameters they are similar in value to the estimates of the ordinary regression. This is due to have applied a distribution to pripori “vague” for Betha. Any of the small differences between a posteriori medians and least squares estimates they are due to small errors inherent to the simulation. Suppose I am interested in estimating the mean extinction time, E (y | x’) = x’Betha, for four nesting pairs and four combinations of SIZE and STATUS. The values of the four sets of covariates are shown in the following table
| Set of covariates | Nesting pairs | SIZE | STATUS |
|---|---|---|---|
| A | 4 | small | migrant |
| B | 4 | small | resident |
| C | 4 | large | migrant |
| D | 4 | large | resident |
The following R statements define the four sets of covariates and are stored in matrix X1. The blinregexpected function returns a simulated sample of the expected response E (y | x’) = x’Betha for each set of covariates. The inputs are the matrix values of the covariates X1 and the list of simulated values of Betha and Sigma obtained from the binlinreg function. The output is a matrix whose columns contain the simulations for a given set of covariates. Are built In addition, the histograms of the simulations for each of the mean times of extinction.
cov1=c(1,4,0,0)
cov2=c(1,4,1,0)
cov3=c(1,4,0,1)
cov4=c(1,4,1,1)
X1=rbind(cov1,cov2,cov3,cov4)
mean.draws=blinregexpected(X1,theta.sample)
par(mfrow=c(2,2))
hist(mean.draws[,1],main="Joint covariance A",xlab="log TIME")
hist(mean.draws[,2],main="Joint covariance B",xlab="log TIME")
hist(mean.draws[,3],main="Joint covariance C",xlab="log TIME")
hist(mean.draws[,4],main="Joint covariance D",xlab="log TIME")
Suppose I am now interested in predicting a future answer ˜y for a given vector of covariates x’. The procedure is analogous to the previous study. using in this case the blinregpred function
cov1=c(1,4,0,0)
cov2=c(1,4,1,0)
cov3=c(1,4,0,1)
cov4=c(1,4,1,1)
X1=rbind(cov1,cov2,cov3,cov4)
mean.draws=blinregpred(X1,theta.sample)
par(mfrow=c(2,2))
hist(mean.draws[,1],main="Joint covariance A",xlab="log TIME")
hist(mean.draws[,2],main="Joint covariance B",xlab="log TIME")
hist(mean.draws[,3],main="Joint covariance C",xlab="log TIME")
hist(mean.draws[,4],main="Joint covariance D",xlab="log TIME")
Comparing the histograms of the two studies, it is observed that the distributions predictive values are substantially wider than the distributions of the responses socks. Finally, I check whether the observations are consistent with the fitted model. For this I rely on the Bayesian residuals Ei = yi- xiBetha. I calculate the posterior probabilities of the outliers P (| Ei |> k|y) for all observations and a constant value k. These probabilities are calculated with the function bayesresiduals. the inputs are the fit structure of the lm fit, the matrix of simulated parameters theta.sample and the value k. The output is a vector of probabilities a posteriori of outliers. In this case, I use a cutoff value k = 2. I use the plot command to plot these probabilities against the NESTING covariate. Using identify we can identify four birds that have probabilities 0.4 or greater. These birds have extinction times that are not well explained by the variables NESTING, SIZE and STATUS.
prob.out=bayesresiduals(fit,theta.sample,2)
par(mfrow=c(1,1))
plot(nesting,prob.out)
#identify(nesting,prob.out,label=species,n=4) #No useful in this post