The Question:
For the challenger dataset ,propose a Bayesian model and provide a suitable analysis of the data. Use Markov Chain Monte Carlo to simulate from the posterior.
Cleaning the environment:
rm(list=ls())
Importing the Challenger dataset
dataset=read.csv("challenger.csv")
print(dataset)
## Temperature Failure
## 1 53 1
## 2 57 1
## 3 58 1
## 4 63 1
## 5 66 0
## 6 67 0
## 7 67 0
## 8 67 0
## 9 68 0
## 10 69 0
## 11 70 0
## 12 70 0
## 13 70 1
## 14 70 1
## 15 72 0
## 16 73 0
## 17 75 0
## 18 75 1
## 19 76 0
## 20 76 0
## 21 78 0
## 22 79 0
## 23 81 0
The challenger dataset has 23 rows , which correspond to 23 space shuttles and there are two variables corresponding to each space shuttle, the Temprature at the time of launch and a binary variable which indicates whether there was an Oring failure or not. I have named the columns “Temperature” and “Failure” respectively.
History of the dataset: The Challenger disaster occurred on the 28th January of 1986, when the NASA Space Shuttle orbiter Challenger broke apart and disintegrated at 73 seconds into its flight, leading to the deaths of its seven crew members. The commission determined that the disintegration began with the failure of an O-ring seal in the solid rocket motor due to the unusually cold temperature ( -0.6 Celsius degrees) during the launch. The problem with O-rings was something known: the night before the launch, there was a three-hour teleconference between motor engineers and NASA management, discussing the effect of low temperature forecasted for the launch on the O-ring performance.
Here,with this data analysis we will establish whether temperature during launch has an impact on the Orings.
Plotting the dataset
library(ggplot2)
ggplot(dataset, aes(x=Temperature, y=Failure,color=as.factor(Failure)))+ geom_point()
1 indicates atleast one failure and 0 indicates no failure. From the graph of the dataset we can get an idea that most Oring failure occurs at low temperatures.We formalize this idea by doing a logistic regression in the dataset. Since the response variable is binary,we fit a logistic regression.
There are two approaches to fitting a logistic regression:
1.Classical Logistic Regression: This involves maximizing the likelihood function(finding MLE).There is no closed solution to the MLE so we find it using iterative methods. This gives a single estimate of β0 and β1. The model we try to fit is:
logit(p)=β0+β1*x
y~ bernoulli(p)
(logit(p)=log(p/(1-p)))
2.Bayesian Logistic Regression:In bayesian logistic regression, you start with an initial belief about the distribution of p(β0,β1). Then p(β0,β1|x,y)∝p(y|β0,β1,x)p(β0,β1). That is, the posterior, which is our updated belief about the weights given evidence, is proportional to our prior (initial belief) times the likelihood. We can’t evaluate the closed form posterior, but can approximate it by sampling or variational methods(This is where MCMC sampling will com into play). This gives us a distribution over the weights.
Splitting the dataset into training and test set:
set.seed(111)
dt = sort(sample(nrow(dataset), nrow(dataset)*.5))
train<-dataset[dt,]
test<-dataset[-dt,]
print(test)
## Temperature Failure
## 2 57 1
## 4 63 1
## 6 67 0
## 7 67 0
## 9 68 0
## 11 70 0
## 12 70 0
## 16 73 0
## 17 75 0
## 18 75 1
## 21 78 0
## 22 79 0
print(train)
## Temperature Failure
## 1 53 1
## 3 58 1
## 5 66 0
## 8 67 0
## 10 69 0
## 13 70 1
## 14 70 1
## 15 72 0
## 19 76 0
## 20 76 0
## 23 81 0
First fitting classical logistic regression model:
classic = glm (Failure~Temperature,data=train, family = binomial)
summary(classic)
##
## Call:
## glm(formula = Failure ~ Temperature, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1769 -0.7585 -0.3994 0.3923 1.6189
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 16.4162 11.2998 1.453 0.146
## Temperature -0.2487 0.1647 -1.510 0.131
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 14.4206 on 10 degrees of freedom
## Residual deviance: 9.6597 on 9 degrees of freedom
## AIC: 13.66
##
## Number of Fisher Scoring iterations: 5
The estimated coefficients by classical logistic regression is β0=16.4162 and β1=-0.2487. It took 5 iterations to reach to this result.
Considering the accuracy score,precision and recall:
library(epiR)
## Loading required package: survival
## Package epiR 1.0-15 is loaded
## Type help(epi.about) for summary information
## Type browseVignettes(package = 'epiR') to learn how to use epiR for applied epidemiological analyses
##
prob = predict(classic,test,type="response")
prob
## 2 4 6 7 9 11 12
## 0.90356934 0.67809706 0.43783535 0.43783535 0.37784624 0.26968999 0.26968999
## 16 17 18 21 22
## 0.14900325 0.09622108 0.09622108 0.04805411 0.03787225
pred = rep(0, dim(test)[1])
pred[prob > .5] =1
pred
## [1] 1 1 0 0 0 0 0 0 0 0 0 0
conf_matrix=table(pred,test$Failure)
print(conf_matrix)
##
## pred 0 1
## 0 9 1
## 1 0 2
accuracy_score=mean(pred == test$Failure)
print("Accuracy Score:")
## [1] "Accuracy Score:"
print(accuracy_score)
## [1] 0.9166667
epi.tests(conf_matrix)
## Outcome + Outcome - Total
## Test + 9 1 10
## Test - 0 2 2
## Total 9 3 12
##
## Point estimates and 95 % CIs:
## ---------------------------------------------------------
## Apparent prevalence 0.83 (0.52, 0.98)
## True prevalence 0.75 (0.43, 0.95)
## Sensitivity 1.00 (0.66, 1.00)
## Specificity 0.67 (0.09, 0.99)
## Positive predictive value 0.90 (0.55, 1.00)
## Negative predictive value 1.00 (0.16, 1.00)
## Positive likelihood ratio 3.00 (0.61, 14.86)
## Negative likelihood ratio 0.00 (0.00, NaN)
## ---------------------------------------------------------
Thus the accuracy score, sensititvity,specificity are given above.
Sensitivity: The proportion of correct guesses of “No Failure”. Since all the points have been correctly guessed, it is equal to 1
Specificity=The proportion of correct guesses of “Atleast one failure”. SInce 1 point has been incorrectly guessed, it is equal to 67%.
In this dataset,it is important to guess the “Failure” correctly based on the temperature so that necessary precautions can be taken.
Consider the calculations in the Bayesian Paradigm:
We choose the prior based on past experiences and subject beliefs COnsidering two types priors : 1. Uniform Prior
I am using the Metropolis Hastings algorithm to obtain MCMC samples of the posterior distribution. Considering a uniform prior with very small lower value and very large upper values(since we dont have any information of the prior so we have to make it accordingly)
library(Rlab)
## Rlab 2.15.1 attached.
##
## Attaching package: 'Rlab'
## The following object is masked from 'package:survival':
##
## cancer
## The following objects are masked from 'package:stats':
##
## dexp, dgamma, dweibull, pexp, pgamma, pweibull, qexp, qgamma,
## qweibull, rexp, rgamma, rweibull
## The following object is masked from 'package:datasets':
##
## precip
loglikelihood<-function(param){
b0<-param[1]
b1<-param[2]
pred=b0+b1*train$Temperature
return(sum(dbern(train$Failure,exp(pred)/(1+exp(pred)),log=T)))
}
## Prior Distribution
prior <- function(param){
b0 = param[1]
b1 = param[2]
b0prior = dunif(b0,-10^6,10^6,log=T)
b1prior = dunif(b1,-10^6,10^6, log = T)
return(b0prior+b1prior)
}
posterior <- function(param){
return (loglikelihood(param) + prior(param))
}
## MCMC
proposalfunction<-function(param){
return(rnorm(2,mean=param,sd=c(0.1,0.1)))
}
metropolis_MCMC<-function(startvalue,iterations){
chain<-array(dim=c(iterations+1,2))
chain[1,]<-startvalue
for (i in 1:iterations){
proposal=proposalfunction(chain[i,])
probab = exp(posterior(proposal) - posterior(chain[i,]))
if (runif(1) < probab){
chain[i+1,] = proposal
}else{
chain[i+1,] = chain[i,]
}
}
return(chain)
}
startvalue = c(24,0)
chain = metropolis_MCMC(startvalue, 25000)
burnIn =15000
par(mfrow = c(2,2))
hist(chain[-(1:burnIn),1], main="Posterior of b0" )
abline(v = mean(chain[-(1:burnIn),1]))
hist(chain[-(1:burnIn),2], main="Posterior of b1")
abline(v = mean(chain[-(1:burnIn),2]))
plot(chain[-(1:burnIn),1], type = "l", main = "Chain values of b0")
plot(chain[-(1:burnIn),2], type = "l", main = "Chain values of b1")
u1=mean(chain[-(1:burnIn),1])
v1=mean(chain[-(1:burnIn),2])
print("The estimates are:")
## [1] "The estimates are:"
print(u1)
## [1] 22.44542
print(v1)
## [1] -0.3382461
thisfunc1<-function(x){
return(exp(u1+v1*x)/(1+exp(u1+v1*x)))
}
bp<-unlist(lapply(test$Temperature,FUN=thisfunc1))
bp
## [1] 0.95951116 0.75692997 0.44594629 0.44594629 0.36463352 0.22586631
## [7] 0.22586631 0.09564838 0.05102670 0.05102670 0.01911898 0.01370742
pred = rep(0, dim(test)[1])
pred[bp > .5] =1
pred
## [1] 1 1 0 0 0 0 0 0 0 0 0 0
conf_matrix=table(pred,test$Failure)
print(conf_matrix)
##
## pred 0 1
## 0 9 1
## 1 0 2
accuracy_score=mean(pred == test$Failure)
print("Accuracy Score:")
## [1] "Accuracy Score:"
print(accuracy_score)
## [1] 0.9166667
epi.tests(conf_matrix)
## Outcome + Outcome - Total
## Test + 9 1 10
## Test - 0 2 2
## Total 9 3 12
##
## Point estimates and 95 % CIs:
## ---------------------------------------------------------
## Apparent prevalence 0.83 (0.52, 0.98)
## True prevalence 0.75 (0.43, 0.95)
## Sensitivity 1.00 (0.66, 1.00)
## Specificity 0.67 (0.09, 0.99)
## Positive predictive value 0.90 (0.55, 1.00)
## Negative predictive value 1.00 (0.16, 1.00)
## Positive likelihood ratio 3.00 (0.61, 14.86)
## Negative likelihood ratio 0.00 (0.00, NaN)
## ---------------------------------------------------------
Thus based on uniform prior, we have obtained samples from the posterior distribution and obtained the means of the samples as estimates of b0 and b1.
The estimates are 20.05306 and -0.30286
2.Normal Prior I am using the Metropolis Hastings algorithm to obtain MCMC samples of the posterior distribution. Considering a normal prior with very large variance(since we dont have any information of the prior so we have to make it accordingly)
loglikelihood<-function(param){
b0<-param[1]
b1<-param[2]
pred=b0+b1*train$Temperature
return(sum(dbern(train$Failure,exp(pred)/(1+exp(pred)),log=T)))
}
## Prior Distribution
prior <- function(param){
b0 = param[1]
b1 = param[2]
b0prior = dnorm(b0,0,100,log=T)
b1prior = dnorm(b1,0,100, log = T)
return(b0prior+b1prior)
}
posterior <- function(param){
return (loglikelihood(param) + prior(param))
}
## MCMC
proposalfunction<-function(param){
return(rnorm(2,mean=param,sd=c(0.1,0.1)))
}
metropolis_MCMC<-function(startvalue,iterations){
chain<-array(dim=c(iterations+1,2))
chain[1,]<-startvalue
for (i in 1:iterations){
proposal=proposalfunction(chain[i,])
probab = exp(posterior(proposal) - posterior(chain[i,]))
if (runif(1) < probab){
chain[i+1,] = proposal
}else{
chain[i+1,] = chain[i,]
}
}
return(chain)
}
startvalue = c(20,-0.1)
chain = metropolis_MCMC(startvalue, 25000)
burnIn = 15000
par(mfrow = c(2,2))
hist(chain[-(1:burnIn),1], main="Posterior of b0" )
abline(v = mean(chain[-(1:burnIn),1]))
hist(chain[-(1:burnIn),2], main="Posterior of b1")
abline(v = mean(chain[-(1:burnIn),2]))
plot(chain[-(1:burnIn),1], type = "l", main = "Chain values of b0")
plot(chain[-(1:burnIn),2], type = "l", main = "Chain values of b1")
u=mean(chain[-(1:burnIn),1])
v=mean(chain[-(1:burnIn),2])
print("The estimates are:")
## [1] "The estimates are:"
print(u)
## [1] 15.34754
print(v)
## [1] -0.235188
Thus based on normal prior, we have obtained samples from the posterior distribution and obtained the means of the samples as estimates of b0 and b1.
Considering the accuracy and other scores in case of bayesian logistic regression:
thisfunc<-function(x){
return(exp(u+v*x)/(1+exp(u+v*x)))
}
bp<-unlist(lapply(test$Temperature,FUN=thisfunc))
bp
## [1] 0.87455193 0.62964444 0.39889767 0.39889767 0.34406112 0.24682383
## [7] 0.24682383 0.13929134 0.09182401 0.09182401 0.04755577 0.03796755
pred = rep(0, dim(test)[1])
pred[bp > .5] =1
pred
## [1] 1 1 0 0 0 0 0 0 0 0 0 0
conf_matrix=table(pred,test$Failure)
print(conf_matrix)
##
## pred 0 1
## 0 9 1
## 1 0 2
accuracy_score=mean(pred == test$Failure)
print("Accuracy Score:")
## [1] "Accuracy Score:"
print(accuracy_score)
## [1] 0.9166667
epi.tests(conf_matrix)
## Outcome + Outcome - Total
## Test + 9 1 10
## Test - 0 2 2
## Total 9 3 12
##
## Point estimates and 95 % CIs:
## ---------------------------------------------------------
## Apparent prevalence 0.83 (0.52, 0.98)
## True prevalence 0.75 (0.43, 0.95)
## Sensitivity 1.00 (0.66, 1.00)
## Specificity 0.67 (0.09, 0.99)
## Positive predictive value 0.90 (0.55, 1.00)
## Negative predictive value 1.00 (0.16, 1.00)
## Positive likelihood ratio 3.00 (0.61, 14.86)
## Negative likelihood ratio 0.00 (0.00, NaN)
## ---------------------------------------------------------
We obtain similar accuracy score in case of the bayesian logistic regression as well