1. Load Required Libraries

2. Identifying Key Predictors Influencing Concrete Strength

To determine which mix components significantly affect concrete strength, we examine the data set and rename columns for clarity.

Load and Inspect Data

concrete <- read.csv("C:\\Users\\workd\\Desktop\\concrete.csv")
head(concrete)

##   rownames cement blast_furnace_slag fly_ash water superplasticizer
## 1        1  540.0                0.0       0   162              2.5
## 2        2  540.0                0.0       0   162              2.5
## 3        3  332.5              142.5       0   228              0.0
## 4        4  332.5              142.5       0   228              0.0
## 5        5  198.6              132.4       0   192              0.0
## 6        6  266.0              114.0       0   228              0.0
##   coarse_aggregate fine_aggregate age compressive_strength
## 1           1040.0          676.0  28                79.99
## 2           1055.0          676.0  28                61.89
## 3            932.0          594.0 270                40.27
## 4            932.0          594.0 365                41.05
## 5            978.4          825.5 360                44.30
## 6            932.0          670.0  90                47.03

colnames(concrete)

##  [1] "rownames"             "cement"               "blast_furnace_slag"  
##  [4] "fly_ash"              "water"                "superplasticizer"    
##  [7] "coarse_aggregate"     "fine_aggregate"       "age"                 
## [10] "compressive_strength"

Drop the first and forth column (potential redundancy)

concrete <- concrete[-c(1,4)]

Rename Columns for Clarity

# Rename columns for clarity
concrete <- setNames(concrete, c("Cement", "BFS", "Water", "Superplast", "CoarseAggr", "FineAggr", "Age", "Strength"))
colnames(concrete)

## [1] "Cement"     "BFS"        "Water"      "Superplast" "CoarseAggr"
## [6] "FineAggr"   "Age"        "Strength"

Scale Data for Better Analysis

Key predictors influencing strength identified

concrete_trxf <- preProcess(as.data.frame(concrete), method = c("range"))
concrete <- predict(concrete_trxf, as.data.frame(concrete))
concrete_round <- round(concrete, 3)
head(concrete_round)

##   Cement   BFS Water Superplast CoarseAggr FineAggr   Age Strength
## 1  1.000 0.000 0.321      0.078      0.695    0.206 0.074    0.967
## 2  1.000 0.000 0.321      0.078      0.738    0.206 0.074    0.742
## 3  0.526 0.396 0.848      0.000      0.381    0.000 0.739    0.473
## 4  0.526 0.396 0.848      0.000      0.381    0.000 1.000    0.482
## 5  0.221 0.368 0.561      0.000      0.516    0.581 0.986    0.523
## 6  0.374 0.317 0.848      0.000      0.381    0.191 0.245    0.557

attach(concrete)

3. Develop Linear Regression Models: Strength as Function of Mix Components

To quantify relationships between predictors and concrete strength, we use regression models.

Check and Adjust Response Variable

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2664  0.4001  0.4172  0.5457  1.0000

## [1] 0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.266   1.400   1.417   1.546   2.000

Fit the Linear Model

model <- lm(Strength ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age, data = concrete)
summary(model)

## 
## Call:
## lm(formula = Strength ~ Cement + BFS + Water + Superplast + CoarseAggr + 
##     FineAggr + Age, data = concrete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38497 -0.09019  0.00549  0.08595  0.42865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.52310    0.06136  24.824  < 2e-16 ***
## Cement       0.36906    0.02256  16.357  < 2e-16 ***
## BFS          0.19051    0.02324   8.196 7.39e-16 ***
## Water       -0.50421    0.05044  -9.997  < 2e-16 ***
## Superplast   0.14908    0.03806   3.917 9.56e-05 ***
## CoarseAggr  -0.11786    0.02963  -3.978 7.44e-05 ***
## FineAggr    -0.19142    0.03365  -5.688 1.68e-08 ***
## Age          0.49766    0.02500  19.903  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1325 on 1022 degrees of freedom
## Multiple R-squared:  0.5971, Adjusted R-squared:  0.5944 
## F-statistic: 216.4 on 7 and 1022 DF,  p-value: < 2.2e-16

Perform Box-Cox Transformation

boxcox_result <- boxcox(model)

# Extract the optimal lambda
optimal_lambda <- boxcox_result$x[which.max(boxcox_result$y)]
optimal_lambda

## [1] 0.1010101

From the plot:

The peak of the curve occurs at lambda ≈ 0.5, which suggests that a square root transformation of the response variable (Strength) may improve the model fit.

The 95% confidence interval for lambda includes values slightly below and above 0.5, but it does not include lambda = 1. This indicates that a transformation is likely necessary.

Apply Conditional Transformation: Log vs. Box-Cox Based on Lambda

if (abs(optimal_lambda) < 1e-6) {
  # If lambda is close to 0, use the log transformation
  concrete$Strength_transformed <- log(concrete$Strength)
} else {
  # Otherwise, use the Box-Cox transformation formula
  concrete$Strength_transformed <- (concrete$Strength^optimal_lambda - 1) / optimal_lambda
}

Re-fit the Model with Transformed Response Variable

model_transformed <- lm(Strength_transformed ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age, data = concrete)
summary(model_transformed)

## 
## Call:
## lm(formula = Strength_transformed ~ Cement + BFS + Water + Superplast + 
##     CoarseAggr + FineAggr + Age, data = concrete)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.305074 -0.072005  0.007782  0.068788  0.269325 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.43628    0.04465   9.770  < 2e-16 ***
## Cement       0.25799    0.01642  15.710  < 2e-16 ***
## BFS          0.12622    0.01692   7.461 1.83e-13 ***
## Water       -0.36427    0.03671  -9.924  < 2e-16 ***
## Superplast   0.10917    0.02770   3.941 8.65e-05 ***
## CoarseAggr  -0.09318    0.02156  -4.322 1.70e-05 ***
## FineAggr    -0.15173    0.02449  -6.195 8.43e-10 ***
## Age          0.36696    0.01820  20.165  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09647 on 1022 degrees of freedom
## Multiple R-squared:  0.5931, Adjusted R-squared:  0.5903 
## F-statistic: 212.8 on 7 and 1022 DF,  p-value: < 2.2e-16

Check Model Diagnostics

par(mfrow = c(2, 2))
plot(model_transformed)

Scale-Location: The red line (smoothed curve) is roughly horizontal and the spread of residuals is consistent across the range of fitted values, the assumption of constant variance is satisfied.

Residuals vs Fitted: The residuals are randomly scattered around zero with no clear pattern, the model assumptions are likely satisfied, since there is no clear pattern (e.g., a curve or funnel shape), that suggests non-linearity or heteroscedasticity. Outliers or influential points are also visible in this plot (e.g., points labeled with numbers 225 & 382).

Normal Q-Q Plot: The points fall approximately along the diagonal line, the residuals are normally distributed. Also, deviations at the tails (ends of the line) indicate issues with extreme values or outliers.

Apply the Square Root Transformation

concrete$Strength_transformed <- sqrt(concrete$Strength)

# Re-fit the model with the transformed response variable
model_transformed <- lm(Strength_transformed ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age, data = concrete)
summary(model_transformed)

## 
## Call:
## lm(formula = Strength_transformed ~ Cement + BFS + Water + Superplast + 
##     CoarseAggr + FineAggr + Age, data = concrete)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168821 -0.039092  0.003973  0.038407  0.165222 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.236155   0.025619  48.252  < 2e-16 ***
## Cement       0.150964   0.009421  16.024  < 2e-16 ***
## BFS          0.075685   0.009706   7.798 1.54e-14 ***
## Water       -0.210031   0.021059  -9.973  < 2e-16 ***
## Superplast   0.062577   0.015891   3.938 8.78e-05 ***
## CoarseAggr  -0.051694   0.012371  -4.179 3.18e-05 ***
## FineAggr    -0.084050   0.014052  -5.982 3.05e-09 ***
## Age          0.209741   0.010440  20.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05535 on 1022 degrees of freedom
## Multiple R-squared:  0.5957, Adjusted R-squared:  0.5929 
## F-statistic: 215.1 on 7 and 1022 DF,  p-value: < 2.2e-16

Identify Influential Points

# Get influence measures
influence_results <- influence.measures(model_transformed)

# Extract the first 8 rows of the influence measures
head(influence_results$infmat, 8)

##         dfb.1_     dfb.Cmnt      dfb.BFS     dfb.Watr      dfb.Sprp
## 1  0.108140044  0.072724169 -0.050473986 -0.135966477 -0.1304224376
## 2  0.027427815  0.025123043 -0.011933598 -0.036899116 -0.0357200404
## 3 -0.010394101  0.005555136 -0.006982355 -0.009496709 -0.0007360403
## 4 -0.022258157  0.020373200 -0.013560067  0.006423749  0.0013808878
## 5  0.008963693  0.015545514 -0.048767433  0.025128589  0.0278880678
## 6 -0.022853166 -0.001436967  0.004476184  0.063320585  0.0217074181
## 7 -0.026247250  0.004485714  0.019948065  0.007220629  0.0033894993
## 8  0.001237080  0.006083265 -0.004996015  0.011557347  0.0009209081
##       dfb.CrsA     dfb.FnAg      dfb.Age       dffit     cov.r       cook.d
## 1 -0.058242394 -0.107926386 -0.020505709  0.24361271 0.9863850 0.0073934384
## 2 -0.011390874 -0.028740157 -0.006523911  0.07266273 1.0175784 0.0006603655
## 3  0.022160057  0.052344646 -0.156161141 -0.21341412 1.0041628 0.0056840286
## 4  0.041451764  0.084711518 -0.376964176 -0.44231843 0.9853367 0.0243219021
## 5 -0.006559903 -0.042653586 -0.231612646 -0.24883226 1.0230871 0.0077321735
## 6  0.003031744 -0.008769008  0.005734749  0.13127080 0.9912484 0.0021500366
## 7  0.047942833  0.087562618 -0.352449942 -0.41884145 0.9903048 0.0218225757
## 8 -0.004795988 -0.014650135 -0.014206739  0.04252859 1.0158985 0.0002262643
##           hat
## 1 0.013161891
## 2 0.012722228
## 3 0.016904838
## 4 0.028696448
## 5 0.030130833
## 6 0.005936782
## 7 0.028585244
## 8 0.009392049

4. Exploratory Data Analysis

Pair Plot to Visualize Relationships

pairs(Strength ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age)

## Histogram Distribution of Predictors

df <- concrete
df %>% 
  pivot_longer(cols = colnames(df)) %>% 
  ggplot() + 
  geom_histogram(aes(value), bins = 25) + 
  facet_wrap(~name, scales = 'free') + 
  theme_minimal() + 
  theme(axis.title.y = element_blank())

1. Skewness of Variables Variables with Skewed Distributions: Age, Cement, Strength, Superplast.

Interpretation: Age: The right-skewed distribution suggests that most concrete samples are tested at younger ages (e.g., 0–100 days), with fewer samples at older ages. This implies that older concrete samples are underrepresented, which could bias the model if age has a non-linear relationship with strength.

Cement: The left-skewed distribution indicates that most concrete mixes have higher cement content. This suggests that cement is a critical component, and its effect on strength may be non-linear or plateau at higher levels.

Strength: The right-skewed distribution of the original Strength variable indicates that most concrete samples have lower compressive strength, with fewer samples achieving high strength. This skewness violates the normality assumption for linear regression, which is why the transformation (Strength_transformed) was applied.

Superplast: The right-skewed distribution suggests that low levels of superplasticizer are more common in the dataset. Superplasticizer is used to improve workability, and its effect on strength may be more pronounced at higher levels.

Bimodality of Variables Variables with Bimodal Distributions: BFS, Water.

Interpretation: BFS (Blast Furnace Slag): The bimodal distribution suggests that there are two distinct groups of concrete mixes: one with little to no BFS and another with significant amounts of BFS. This could indicate different types of concrete, which may have different strength characteristics.

Water: The bimodal distribution indicates that there are two distinct water content levels in the dataset. This could reflect different mix designs (e.g., high-water vs. low-water mixes), which can significantly affect strength. Higher water content generally reduces strength due to increased porosity, while lower water content improves strength but may reduce workability.

Transformation Effect on Strength Strength_transformed: The transformed Strength variable has a more symmetric distribution, indicating that the transformation improved the normality of the data.

Interpretation: The transformation makes the relationship between Strength and the predictors more linear, which is essential for linear regression models.

This suggests that the original relationship between predictors and strength may have been non-linear, and the transformation helps capture this relationship more accurately.

Scatter Plots Against Strength

# Scatter plots to visualize relationships between predictors and strength
concrete %>%
  pivot_longer(cols = -Strength, names_to = "predictor", values_to = "value") %>%
  ggplot(aes(x = value, y = Strength)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  facet_wrap(~predictor, scales = "free") +
  theme_minimal() +
  labs(title = "Scatter Plots of Predictors vs. Strength", x = "Predictor Value", y = "Strength")

## `geom_smooth()` using formula = 'y ~ x'

## Density Plot

# Load necessary libraries
library(tidyverse)

# Create histograms for all variables
concrete %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 25, fill = "blue", color = "black") +
  facet_wrap(~name, scales = "free") +
  theme_minimal() +
  labs(title = "Histograms of Concrete Dataset Variables", x = "Value", y = "Frequency")

Based on the distributions and their implications, the following variables are likely to have the greatest effect on concrete strength:

Cement: The left-skewed distribution indicates that cement is a key component in most concrete mixes. Higher cement content generally leads to higher strength, but the effect may plateau at very high levels.

Water: The bimodal distribution suggests that water content has a significant impact on strength. Lower water content typically results in higher strength, while higher water content reduces strength.

Age: The right-skewed distribution indicates that older concrete samples are underrepresented, but age is known to have a strong positive effect on strength (concrete gains strength over time).

BFS: The bimodal distribution suggests that slag-based concrete (with higher BFS content) may have different strength characteristics compared to ordinary Portland cement.

Check for Multicollinearity Among Variables

pairs(df)

5. Assess Model Diagnostics

Standardized Residuals vs. Predictors

par(mfrow = c(2, 4))
rstand <- rstandard(model_transformed)
plot(rstand ~ Cement)
plot(rstand ~ BFS)
plot(rstand ~ Water)
plot(rstand ~ Superplast)
plot(rstand ~ CoarseAggr)
plot(rstand ~ FineAggr)
plot(rstand ~ Age)
plot(rstand ~ model_transformed$fitted.values)

## Added Variable Plot for Each Variable

par(mfrow = c(1, 3))
avPlots(model_transformed)

Conclusion

Age

Relationship: The scatter plot likely shows a positive trend, where strength increases with age. This is expected because concrete gains strength over time as it cures. Interpretation: Older concrete samples tend to have higher strength, but the relationship may plateau at very high ages.

Cement

Relationship: The scatter plot likely shows a positive trend, where strength increases with cement content. This is expected because cement is a primary binding agent in concrete. Interpretation: Higher cement content generally leads to higher strength, but the effect may plateau at very high cement levels.

CoarseAggr (Coarse Aggregate)

Relationship: The scatter plot may show a weak or flat trend, indicating that coarse aggregate has a limited direct effect on strength. Interpretation: Coarse aggregate primarily provides bulk and stability to the concrete mix, but its effect on strength is less pronounced compared to other components like cement or water.

FineAggr (Fine Aggregate)

Relationship: The scatter plot may show a weak or flat trend, similar to coarse aggregate. Interpretation: Fine aggregate fills voids and improves workability, but it has a minimal direct effect on strength.

Superplast (Superplasticizer)

Relationship: The scatter plot may show a positive trend, where strength increases with superplasticizer content. Interpretation: Superplasticizer improves workability and can enhance strength by allowing for a lower water-cement ratio. However, its effect may be more pronounced at moderate levels.

Water

Relationship: The scatter plot likely shows a negative trend, where strength decreases with higher water content. Interpretation: Excess water increases porosity and reduces the density of concrete, leading to lower strength. A lower water-cement ratio generally results in higher strength.

Summary of Key Findings

Strongest Positive Effects: Cement and Age have the strongest positive relationships with strength.

Strongest Negative Effect: Water has a strong negative relationship with strength.

Moderate Effects: BFS and Superplast show moderate effects, with potential non-linear or bimodal relationships.

Weak Effects: CoarseAggr and FineAggr have minimal direct effects on strength.

Concrete Strength

Ndubuisi Chibuogwu

2022-03-11