Bayesian Logistic Regression

Author

Federico Durante

Bayesian logistic regression: concrete compressive strength

This document describes a Bayesian Logistic Regression Analysis to predict whether concrete has high compressive strength (compressive_strength_class = 1) or low compressive strength (compressive_strength_class = 0). The dataset contains 1030 observations, eight independent variables and one dependent variable. Below are the specifications of the variables:

  • cement: quantitative, measured in kg per m³ mixture

  • blast_furnace_slag: quantitative, measured in kg per m³ mixture

  • fly_ash: quantitative, measured in kg per m³ mixture

  • water: quantitative, measured in kg per m³ mixture

  • superplasticizer: quantitative, measured in kg per m³ mixture

  • coarse_aggregate: quantitative, measured in kg per m³ mixture

  • fine_aggregate: quantitative, measured in kg per m³ mixture

  • age: quantitative, measured in days (1~365)

  • compressive_strength_class: binary, 1 for high compressive strength, 0 for low compressive strength (dependent variable).

First, we load the necessary libraries and read the dataset.

library(tidyverse)
Warning: il pacchetto 'tidyverse' è stato creato con R versione 4.3.2
Warning: il pacchetto 'ggplot2' è stato creato con R versione 4.3.3
Warning: il pacchetto 'tidyr' è stato creato con R versione 4.3.3
Warning: il pacchetto 'purrr' è stato creato con R versione 4.3.2
Warning: il pacchetto 'dplyr' è stato creato con R versione 4.3.2
Warning: il pacchetto 'stringr' è stato creato con R versione 4.3.2
Warning: il pacchetto 'forcats' è stato creato con R versione 4.3.2
Warning: il pacchetto 'lubridate' è stato creato con R versione 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(R2jags)
Warning: il pacchetto 'R2jags' è stato creato con R versione 4.3.3
Caricamento del pacchetto richiesto: rjags
Warning: il pacchetto 'rjags' è stato creato con R versione 4.3.3
Caricamento del pacchetto richiesto: coda
Warning: il pacchetto 'coda' è stato creato con R versione 4.3.3
Linked to JAGS 4.3.1
Loaded modules: basemod,bugs

Caricamento pacchetto: 'R2jags'

Il seguente oggetto è mascherato da 'package:coda':

    traceplot
library(coda)
library(mcmcplots)
Warning: il pacchetto 'mcmcplots' è stato creato con R versione 4.3.3
Registered S3 method overwritten by 'mcmcplots':
  method        from  
  as.mcmc.rjags R2jags
df = read.csv("C:\\Users\\Federico Durante\\Downloads\\concreteX.csv")

Next, we create boxplots to visualize the distribution of the input variables based on the compressive strength class.

par(mfrow=c(3,3),mar=c(3,3,1,1),mgp=c(1.75,.75,0),oma=c(0.33,0,.25,0))
g=factor(df$compressive_strength_class,c(0,1),c("low strength","high strength"))
boxplot(df$cement~g,range=0,boxwex=0.5,ylab="cement",xlab="",col=c("blue","red"))
boxplot(df$water~g,range=0,boxwex=0.5,ylab="water",xlab="",col=c("blue","red"))
boxplot(df$age~g,range=0,boxwex=0.5,ylab="age",xlab="",col=c("blue","red"))
boxplot(df$fly_ash~g,range=0,boxwex=0.5,ylab="fly_ash",xlab="",col=c("blue","red"))
boxplot(df$superplasticizer~g,range=0,boxwex=0.5,ylab="superplasticizer",xlab="",col=c("blue","red"))
boxplot(df$blast_furnace_slag~g,range=0,boxwex=0.5,ylab="blast_furnace_slug",xlab="",col=c("blue","red"))
boxplot(df$coarse_aggregate~g,range=0,boxwex=0.5,ylab="coarse_aggregate",xlab="",col=c("blue","red"))
boxplot(df$fine_aggregate~g,range=0,boxwex=0.5,ylab="fine_aggregate",xlab="",col=c("blue","red"))

  1. Cement:

    • There is a noticeable difference in the distribution of cement content between the two groups. The “high strength” group tends to have a higher median cement content compared to the “low strength” group.
  2. Water:

    • The water content shows less distinct differences between the two groups. However, there is a slightly lower median water content in the “high strength” group.
  3. Age:

    • Age shows a marked difference. The “high strength” group tends to have a higher age distribution, suggesting that older concrete is less likely be in the “low strength” class.
  4. Fly Ash:

    • The distribution of fly ash content is similar between the two groups, with no significant difference in the median values.
  5. Superplasticizer:

    • The superplasticizer content shows a higher median and generally higher values in the “high strength” group, indicating that superplasticizer may contribute to concrete strength.
  6. Blast Furnace Slag:

    • The distribution of blast furnace slag content is similar between the two groups, with slight differences in the interquartile range but not in the median values.
  7. Coarse Aggregate:

    • There is a slightly higher median coarse aggregate content in the “high strength” class, but the overall distribution is quite similar between the two groups.
  8. Fine Aggregate:

    • Fine aggregate content shows no significant difference between the two groups in terms of median values or overall distribution.

These boxplots suggest that certain properties like cement content, age, and superplasticizer content may have a stronger association with whether concrete is high strength or low strength, while others like fly ash and fine aggregate content do not show as clear a relationship.

Model specification and fitting

We set up the design matrix and define the Bayesian logistic regression model.

# Design matrix creation
X = model.matrix(compressive_strength_class ~ cement + blast_furnace_slag + fly_ash + water + superplasticizer + coarse_aggregate + fine_aggregate + age, data=df)

# Define dimensions
n <- nrow(X)
p <- ncol(X)
y <- df$compressive_strength_class

# JAGS model definition
# For each observation i, the binary outcome y[i] is modeled using a Bernoulli distribution with          probability theta[i].
# The coefficients beta are assigned a normal prior distribution with mean 0 and precision pre_beta[j].
model_logit <- function() {
  for (i in 1:n) {
    y[i] ~ dbern(theta[i])
    logit(theta[i]) <- beta[1]*X[i,1] + beta[2]*X[i,2] + beta[3]*X[i,3] + beta[4]*X[i,4] + beta[5]*X[i,5] + beta[6]*X[i,6] + beta[7]*X[i,7] + beta[8]*X[i,8] + beta[9]*X[i,9]
  }
  for (j in 1:9) {
    beta[j] ~ dnorm(0, pre_beta[j])
  }
}

# Define priors
# pre_beta is a vector of prior precision values (very small, implying vague/flat priors).
pre_beta <- rep(0.0001, 9)
data <- list("n", "y", "X", "pre_beta")
params <- c("beta")
inits1 <- list(beta=rep(0, 9))
inits <- list(inits1)

# MCMC Settings and Model Fitting:
G <- 5000
n.burnin <- 1000
n.iter <- n.burnin + G
n.thin <- 5
out_logit <- jags(data=data, inits, params, model.file=model_logit, n.iter=n.iter, n.burnin=n.burnin, n.thin=n.thin, n.chains=1, digits=8)
module glm loaded
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 1030
   Unobserved stochastic nodes: 9
   Total graph size: 13830

Initializing model
out_logit <- update(out_logit, n.iter=n.iter, n.burnin=n.burnin, n.thin=n.thin)

# Posterior Analysis
# Extract the posterior samples of the beta coefficients from the BUGSoutput.
mcmcplot(as.mcmc(out_logit))

                                                                                
Preparing plots for deviance.  10% complete.

                                                                                
Preparing plots for beta.  20% complete.

                                                                                
Preparing plots for beta.  30% complete.

                                                                                
Preparing plots for beta.  40% complete.

                                                                                
Preparing plots for beta.  50% complete.

                                                                                
Preparing plots for beta.  60% complete.

                                                                                
Preparing plots for beta.  70% complete.

                                                                                
Preparing plots for beta.  80% complete.

                                                                                
Preparing plots for beta.  90% complete.

                                                                                
Preparing plots for beta.  100% complete.
beta_post <- out_logit$BUGSoutput$sims.array[,1,1:9]
colnames(beta_post)
[1] "beta[1]" "beta[2]" "beta[3]" "beta[4]" "beta[5]" "beta[6]" "beta[7]"
[8] "beta[8]" "beta[9]"

Posterior analysis

We analyze the posterior distributions of the regression coefficients.

  • Plot Layout: The 3x3 grid allows for simultaneous visualization of the posterior distributions for all 9 beta coefficients.

  • Visual Comparison: Each subplot includes both the posterior density (histogram and smoothed line) and the prior density (dotted line), allowing for direct comparison.

  • Credible Intervals: The annotated vertical lines highlight key quantiles (2.5th, median, and 97.5th) of the posterior, providing insights into the spread and central tendency of the coefficient estimates.

  • Prior Information: The prior density and its quantiles are displayed for context, emphasizing the influence (or lack thereof) of the prior on the posterior estimates.

par(mfrow=c(3,3),mar=c(3,3,1,1),mgp=c(1.75,.75,0))
for (j in 1:9) {
  chain <- beta_post[,j]
  prior_quant <- c(qnorm(0.025, 0, sd=sqrt(1/pre_beta)), qnorm(0.975, 0, sd=sqrt(1/pre_beta)))
  post_quant <- quantile(chain, prob=c(0.025, 0.5, 0.975))
  cat("The posterior median of beta[", j, "] is ", post_quant[2],
      "\nThe 95% posterior credible interval for beta[", j, "] is (",
      post_quant[1], ",", post_quant[3], ")\n", sep="")
  post_dens <- approxfun(density(chain, adj=2))
  hist(chain, freq=F, col="gray", main=paste("Posterior density of beta", j, sep=""))
  lines(density(chain, adj=2), col="black", lwd=3, lty=1)
  curve(dnorm(x, 0, sd=sqrt(1/pre_beta)), from=range(chain)[1], to=range(chain)[2], add=T, lty=2, lwd=2)
  segments(post_quant[1], 0, post_quant[1], post_dens(post_quant[1]), col="black", lty=2, lwd=3)
  segments(post_quant[3], 0, post_quant[3], post_dens(post_quant[3]), col="black", lty=2, lwd=3)
  segments(post_quant[2], 0, post_quant[2], post_dens(post_quant[2]), col="black", lty=1, lwd=3)
  segments(prior_quant[1], 0, prior_quant[1], dnorm(prior_quant[1], 0, sd=sqrt(1/pre_beta)), col="black", lty=3, lwd=2)
  segments(prior_quant[2], 0, prior_quant[2], dnorm(prior_quant[2], 0, sd=sqrt(1/pre_beta)), col="black", lty=3, lwd=2)
}
The posterior median of beta[1] is 4.471387
The 95% posterior credible interval for beta[1] is (-9.900266,18.73107)
The posterior median of beta[2] is 0.01460493
The 95% posterior credible interval for beta[2] is (0.01005109,0.01935517)
The posterior median of beta[3] is 0.009256835
The 95% posterior credible interval for beta[3] is (0.003975943,0.01480429)
The posterior median of beta[4] is 0.006696083
The 95% posterior credible interval for beta[4] is (-0.0001006084,0.01343392)
The posterior median of beta[5] is -0.03708872
The 95% posterior credible interval for beta[5] is (-0.05814178,-0.01522475)
The posterior median of beta[6] is 0.08529011
The 95% posterior credible interval for beta[6] is (0.03335616,0.1409831)
The posterior median of beta[7] is -0.0005341935
The 95% posterior credible interval for beta[7] is (-0.005569376,0.004526586)
The posterior median of beta[8] is -0.005948074
The 95% posterior credible interval for beta[8] is (-0.01177929,-0.0004280096)
The posterior median of beta[9] is 0.04179296
The 95% posterior credible interval for beta[9] is (0.03539886,0.04884458)

Summary

  • Positive Predictors: Cement, blast furnace slag, fly ash, superplasticizer, and age positively influence the likelihood of the concrete being “high strength”, with varying degrees of confidence.

  • Negative Predictors: Water and fine aggregate negatively influence the likelihood of the concrete being in the “high strength” class.

  • Non-significant Predictor: Coarse aggregate appears to have little to no effect on the compressive strength class.

These results highlight the importance of certain ingredients and characteristics (e.g., cement content, age) in improving the compressive strength of concrete, while others (e.g., water content) may weaken it. The credible intervals provide a measure of uncertainty, with narrower intervals indicating more reliable estimates.

Predictions and model evaluation

We compute the predicted probabilities and classify the observations into binary outcomes. The accuracy of the model is then calculated.

X <- model.matrix(compressive_strength_class ~ cement + blast_furnace_slag + fly_ash + water + superplasticizer + coarse_aggregate + fine_aggregate + age, data = df)

# Compute the linear predictor
linear_predictor <- X %*% t(beta_post)

# Compute the predicted probabilities using the logistic function
predicted_probabilities <- apply(linear_predictor, 1, function(x) mean(1 / (1 + exp(-x))))

# Binary predictions based on a threshold of 0.5
predicted_classes <- ifelse(predicted_probabilities > 0.5, 1, 0)

# Compare predicted classes to actual classes
confusion_matrix <- table(predicted_classes, df$compressive_strength_class)
print(confusion_matrix)
                 
predicted_classes   0   1
                0 471  97
                1  69 393
# Calculate accuracy
accuracy <- mean(predicted_classes == df$compressive_strength_class)
cat("The prediction accuracy is", accuracy, "\n")
The prediction accuracy is 0.838835 

From this confusion matrix, we can derive the following metrics:

  • True Positives (TP): 393 (predicted 1, actually 1)

  • True Negatives (TN): 471 (predicted 0, actually 0)

  • False Positives (FP): 97 (predicted 1, actually 0)

  • False Negatives (FN): 69 (predicted 0, actually 1)

Using these values, we can calculate several performance metrics:

Summary

  • Accuracy: (TP+TN)/(TP+TN+FP+FN) = 83.88%

  • Precision: TP/(TP+FP) = 80.20%

  • Specificity: TN/(TN+FP) = 82.92%

  • Sensitivity: TP/(TP+FN) = 85.06%