This document describes a Bayesian Logistic Regression Analysis to predict whether concrete has high compressive strength (compressive_strength_class = 1) or low compressive strength (compressive_strength_class = 0). The dataset contains 1030 observations, eight independent variables and one dependent variable. Below are the specifications of the variables:
cement: quantitative, measured in kg per m³ mixture
blast_furnace_slag: quantitative, measured in kg per m³ mixture
fly_ash: quantitative, measured in kg per m³ mixture
water: quantitative, measured in kg per m³ mixture
superplasticizer: quantitative, measured in kg per m³ mixture
coarse_aggregate: quantitative, measured in kg per m³ mixture
fine_aggregate: quantitative, measured in kg per m³ mixture
age: quantitative, measured in days (1~365)
compressive_strength_class: binary, 1 for high compressive strength, 0 for low compressive strength (dependent variable).
First, we load the necessary libraries and read the dataset.
library(tidyverse)
Warning: il pacchetto 'tidyverse' è stato creato con R versione 4.3.2
Warning: il pacchetto 'ggplot2' è stato creato con R versione 4.3.3
Warning: il pacchetto 'tidyr' è stato creato con R versione 4.3.3
Warning: il pacchetto 'purrr' è stato creato con R versione 4.3.2
Warning: il pacchetto 'dplyr' è stato creato con R versione 4.3.2
Warning: il pacchetto 'stringr' è stato creato con R versione 4.3.2
Warning: il pacchetto 'forcats' è stato creato con R versione 4.3.2
Warning: il pacchetto 'lubridate' è stato creato con R versione 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(R2jags)
Warning: il pacchetto 'R2jags' è stato creato con R versione 4.3.3
Caricamento del pacchetto richiesto: rjags
Warning: il pacchetto 'rjags' è stato creato con R versione 4.3.3
Caricamento del pacchetto richiesto: coda
Warning: il pacchetto 'coda' è stato creato con R versione 4.3.3
Linked to JAGS 4.3.1
Loaded modules: basemod,bugs
Caricamento pacchetto: 'R2jags'
Il seguente oggetto è mascherato da 'package:coda':
traceplot
library(coda)library(mcmcplots)
Warning: il pacchetto 'mcmcplots' è stato creato con R versione 4.3.3
Registered S3 method overwritten by 'mcmcplots':
method from
as.mcmc.rjags R2jags
There is a noticeable difference in the distribution of cement content between the two groups. The “high strength” group tends to have a higher median cement content compared to the “low strength” group.
Water:
The water content shows less distinct differences between the two groups. However, there is a slightly lower median water content in the “high strength” group.
Age:
Age shows a marked difference. The “high strength” group tends to have a higher age distribution, suggesting that older concrete is less likely be in the “low strength” class.
Fly Ash:
The distribution of fly ash content is similar between the two groups, with no significant difference in the median values.
Superplasticizer:
The superplasticizer content shows a higher median and generally higher values in the “high strength” group, indicating that superplasticizer may contribute to concrete strength.
Blast Furnace Slag:
The distribution of blast furnace slag content is similar between the two groups, with slight differences in the interquartile range but not in the median values.
Coarse Aggregate:
There is a slightly higher median coarse aggregate content in the “high strength” class, but the overall distribution is quite similar between the two groups.
Fine Aggregate:
Fine aggregate content shows no significant difference between the two groups in terms of median values or overall distribution.
These boxplots suggest that certain properties like cement content, age, and superplasticizer content may have a stronger association with whether concrete is high strength or low strength, while others like fly ash and fine aggregate content do not show as clear a relationship.
Model specification and fitting
We set up the design matrix and define the Bayesian logistic regression model.
# Design matrix creationX =model.matrix(compressive_strength_class ~ cement + blast_furnace_slag + fly_ash + water + superplasticizer + coarse_aggregate + fine_aggregate + age, data=df)# Define dimensionsn <-nrow(X)p <-ncol(X)y <- df$compressive_strength_class# JAGS model definition# For each observation i, the binary outcome y[i] is modeled using a Bernoulli distribution with probability theta[i].# The coefficients beta are assigned a normal prior distribution with mean 0 and precision pre_beta[j].model_logit <-function() {for (i in1:n) { y[i] ~dbern(theta[i])logit(theta[i]) <- beta[1]*X[i,1] + beta[2]*X[i,2] + beta[3]*X[i,3] + beta[4]*X[i,4] + beta[5]*X[i,5] + beta[6]*X[i,6] + beta[7]*X[i,7] + beta[8]*X[i,8] + beta[9]*X[i,9] }for (j in1:9) { beta[j] ~dnorm(0, pre_beta[j]) }}# Define priors# pre_beta is a vector of prior precision values (very small, implying vague/flat priors).pre_beta <-rep(0.0001, 9)data <-list("n", "y", "X", "pre_beta")params <-c("beta")inits1 <-list(beta=rep(0, 9))inits <-list(inits1)# MCMC Settings and Model Fitting:G <-5000n.burnin <-1000n.iter <- n.burnin + Gn.thin <-5out_logit <-jags(data=data, inits, params, model.file=model_logit, n.iter=n.iter, n.burnin=n.burnin, n.thin=n.thin, n.chains=1, digits=8)
module glm loaded
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 1030
Unobserved stochastic nodes: 9
Total graph size: 13830
Initializing model
out_logit <-update(out_logit, n.iter=n.iter, n.burnin=n.burnin, n.thin=n.thin)# Posterior Analysis# Extract the posterior samples of the beta coefficients from the BUGSoutput.mcmcplot(as.mcmc(out_logit))
We analyze the posterior distributions of the regression coefficients.
Plot Layout: The 3x3 grid allows for simultaneous visualization of the posterior distributions for all 9 beta coefficients.
Visual Comparison: Each subplot includes both the posterior density (histogram and smoothed line) and the prior density (dotted line), allowing for direct comparison.
Credible Intervals: The annotated vertical lines highlight key quantiles (2.5th, median, and 97.5th) of the posterior, providing insights into the spread and central tendency of the coefficient estimates.
Prior Information: The prior density and its quantiles are displayed for context, emphasizing the influence (or lack thereof) of the prior on the posterior estimates.
The posterior median of beta[1] is 4.471387
The 95% posterior credible interval for beta[1] is (-9.900266,18.73107)
The posterior median of beta[2] is 0.01460493
The 95% posterior credible interval for beta[2] is (0.01005109,0.01935517)
The posterior median of beta[3] is 0.009256835
The 95% posterior credible interval for beta[3] is (0.003975943,0.01480429)
The posterior median of beta[4] is 0.006696083
The 95% posterior credible interval for beta[4] is (-0.0001006084,0.01343392)
The posterior median of beta[5] is -0.03708872
The 95% posterior credible interval for beta[5] is (-0.05814178,-0.01522475)
The posterior median of beta[6] is 0.08529011
The 95% posterior credible interval for beta[6] is (0.03335616,0.1409831)
The posterior median of beta[7] is -0.0005341935
The 95% posterior credible interval for beta[7] is (-0.005569376,0.004526586)
The posterior median of beta[8] is -0.005948074
The 95% posterior credible interval for beta[8] is (-0.01177929,-0.0004280096)
The posterior median of beta[9] is 0.04179296
The 95% posterior credible interval for beta[9] is (0.03539886,0.04884458)
Summary
Positive Predictors: Cement, blast furnace slag, fly ash, superplasticizer, and age positively influence the likelihood of the concrete being “high strength”, with varying degrees of confidence.
Negative Predictors: Water and fine aggregate negatively influence the likelihood of the concrete being in the “high strength” class.
Non-significant Predictor: Coarse aggregate appears to have little to no effect on the compressive strength class.
These results highlight the importance of certain ingredients and characteristics (e.g., cement content, age) in improving the compressive strength of concrete, while others (e.g., water content) may weaken it. The credible intervals provide a measure of uncertainty, with narrower intervals indicating more reliable estimates.
Predictions and model evaluation
We compute the predicted probabilities and classify the observations into binary outcomes. The accuracy of the model is then calculated.
X <-model.matrix(compressive_strength_class ~ cement + blast_furnace_slag + fly_ash + water + superplasticizer + coarse_aggregate + fine_aggregate + age, data = df)# Compute the linear predictorlinear_predictor <- X %*%t(beta_post)# Compute the predicted probabilities using the logistic functionpredicted_probabilities <-apply(linear_predictor, 1, function(x) mean(1/ (1+exp(-x))))# Binary predictions based on a threshold of 0.5predicted_classes <-ifelse(predicted_probabilities >0.5, 1, 0)# Compare predicted classes to actual classesconfusion_matrix <-table(predicted_classes, df$compressive_strength_class)print(confusion_matrix)