To determine which mix components significantly affect concrete strength, we examine the data set and rename columns for clarity.
concrete <- read.csv("C:\\Users\\workd\\Desktop\\concrete.csv")
head(concrete)
## rownames cement blast_furnace_slag fly_ash water superplasticizer
## 1 1 540.0 0.0 0 162 2.5
## 2 2 540.0 0.0 0 162 2.5
## 3 3 332.5 142.5 0 228 0.0
## 4 4 332.5 142.5 0 228 0.0
## 5 5 198.6 132.4 0 192 0.0
## 6 6 266.0 114.0 0 228 0.0
## coarse_aggregate fine_aggregate age compressive_strength
## 1 1040.0 676.0 28 79.99
## 2 1055.0 676.0 28 61.89
## 3 932.0 594.0 270 40.27
## 4 932.0 594.0 365 41.05
## 5 978.4 825.5 360 44.30
## 6 932.0 670.0 90 47.03
colnames(concrete)
## [1] "rownames" "cement" "blast_furnace_slag"
## [4] "fly_ash" "water" "superplasticizer"
## [7] "coarse_aggregate" "fine_aggregate" "age"
## [10] "compressive_strength"
concrete <- concrete[-c(1,4)]
# Rename columns for clarity
concrete <- setNames(concrete, c("Cement", "BFS", "Water", "Superplast", "CoarseAggr", "FineAggr", "Age", "Strength"))
colnames(concrete)
## [1] "Cement" "BFS" "Water" "Superplast" "CoarseAggr"
## [6] "FineAggr" "Age" "Strength"
Key predictors influencing strength identified
concrete_trxf <- preProcess(as.data.frame(concrete), method = c("range"))
concrete <- predict(concrete_trxf, as.data.frame(concrete))
concrete_round <- round(concrete, 3)
head(concrete_round)
## Cement BFS Water Superplast CoarseAggr FineAggr Age Strength
## 1 1.000 0.000 0.321 0.078 0.695 0.206 0.074 0.967
## 2 1.000 0.000 0.321 0.078 0.738 0.206 0.074 0.742
## 3 0.526 0.396 0.848 0.000 0.381 0.000 0.739 0.473
## 4 0.526 0.396 0.848 0.000 0.381 0.000 1.000 0.482
## 5 0.221 0.368 0.561 0.000 0.516 0.581 0.986 0.523
## 6 0.374 0.317 0.848 0.000 0.381 0.191 0.245 0.557
attach(concrete)
To quantify relationships between predictors and concrete strength, we use regression models.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2664 0.4001 0.4172 0.5457 1.0000
## [1] 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.266 1.400 1.417 1.546 2.000
model <- lm(Strength ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age, data = concrete)
summary(model)
##
## Call:
## lm(formula = Strength ~ Cement + BFS + Water + Superplast + CoarseAggr +
## FineAggr + Age, data = concrete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38497 -0.09019 0.00549 0.08595 0.42865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.52310 0.06136 24.824 < 2e-16 ***
## Cement 0.36906 0.02256 16.357 < 2e-16 ***
## BFS 0.19051 0.02324 8.196 7.39e-16 ***
## Water -0.50421 0.05044 -9.997 < 2e-16 ***
## Superplast 0.14908 0.03806 3.917 9.56e-05 ***
## CoarseAggr -0.11786 0.02963 -3.978 7.44e-05 ***
## FineAggr -0.19142 0.03365 -5.688 1.68e-08 ***
## Age 0.49766 0.02500 19.903 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1325 on 1022 degrees of freedom
## Multiple R-squared: 0.5971, Adjusted R-squared: 0.5944
## F-statistic: 216.4 on 7 and 1022 DF, p-value: < 2.2e-16
boxcox_result <- boxcox(model)
# Extract the optimal lambda
optimal_lambda <- boxcox_result$x[which.max(boxcox_result$y)]
optimal_lambda
## [1] 0.1010101
The peak of the curve occurs at lambda ≈ 0.5, which suggests that a square root transformation of the response variable (Strength) may improve the model fit.
The 95% confidence interval for lambda includes values slightly below and above 0.5, but it does not include lambda = 1. This indicates that a transformation is likely necessary.
if (abs(optimal_lambda) < 1e-6) {
# If lambda is close to 0, use the log transformation
concrete$Strength_transformed <- log(concrete$Strength)
} else {
# Otherwise, use the Box-Cox transformation formula
concrete$Strength_transformed <- (concrete$Strength^optimal_lambda - 1) / optimal_lambda
}
model_transformed <- lm(Strength_transformed ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age, data = concrete)
summary(model_transformed)
##
## Call:
## lm(formula = Strength_transformed ~ Cement + BFS + Water + Superplast +
## CoarseAggr + FineAggr + Age, data = concrete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.305074 -0.072005 0.007782 0.068788 0.269325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.43628 0.04465 9.770 < 2e-16 ***
## Cement 0.25799 0.01642 15.710 < 2e-16 ***
## BFS 0.12622 0.01692 7.461 1.83e-13 ***
## Water -0.36427 0.03671 -9.924 < 2e-16 ***
## Superplast 0.10917 0.02770 3.941 8.65e-05 ***
## CoarseAggr -0.09318 0.02156 -4.322 1.70e-05 ***
## FineAggr -0.15173 0.02449 -6.195 8.43e-10 ***
## Age 0.36696 0.01820 20.165 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09647 on 1022 degrees of freedom
## Multiple R-squared: 0.5931, Adjusted R-squared: 0.5903
## F-statistic: 212.8 on 7 and 1022 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model_transformed)
Scale-Location: The red line (smoothed curve) is roughly horizontal and
the spread of residuals is consistent across the range of fitted values,
the assumption of constant variance is satisfied.
Residuals vs Fitted: The residuals are randomly scattered around zero with no clear pattern, the model assumptions are likely satisfied, since there is no clear pattern (e.g., a curve or funnel shape), that suggests non-linearity or heteroscedasticity. Outliers or influential points are also visible in this plot (e.g., points labeled with numbers 225 & 382).
Normal Q-Q Plot: The points fall approximately along the diagonal line, the residuals are normally distributed. Also, deviations at the tails (ends of the line) indicate issues with extreme values or outliers.
concrete$Strength_transformed <- sqrt(concrete$Strength)
# Re-fit the model with the transformed response variable
model_transformed <- lm(Strength_transformed ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age, data = concrete)
summary(model_transformed)
##
## Call:
## lm(formula = Strength_transformed ~ Cement + BFS + Water + Superplast +
## CoarseAggr + FineAggr + Age, data = concrete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168821 -0.039092 0.003973 0.038407 0.165222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.236155 0.025619 48.252 < 2e-16 ***
## Cement 0.150964 0.009421 16.024 < 2e-16 ***
## BFS 0.075685 0.009706 7.798 1.54e-14 ***
## Water -0.210031 0.021059 -9.973 < 2e-16 ***
## Superplast 0.062577 0.015891 3.938 8.78e-05 ***
## CoarseAggr -0.051694 0.012371 -4.179 3.18e-05 ***
## FineAggr -0.084050 0.014052 -5.982 3.05e-09 ***
## Age 0.209741 0.010440 20.090 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05535 on 1022 degrees of freedom
## Multiple R-squared: 0.5957, Adjusted R-squared: 0.5929
## F-statistic: 215.1 on 7 and 1022 DF, p-value: < 2.2e-16
# Get influence measures
influence_results <- influence.measures(model_transformed)
# Extract the first 8 rows of the influence measures
head(influence_results$infmat, 8)
## dfb.1_ dfb.Cmnt dfb.BFS dfb.Watr dfb.Sprp
## 1 0.108140044 0.072724169 -0.050473986 -0.135966477 -0.1304224376
## 2 0.027427815 0.025123043 -0.011933598 -0.036899116 -0.0357200404
## 3 -0.010394101 0.005555136 -0.006982355 -0.009496709 -0.0007360403
## 4 -0.022258157 0.020373200 -0.013560067 0.006423749 0.0013808878
## 5 0.008963693 0.015545514 -0.048767433 0.025128589 0.0278880678
## 6 -0.022853166 -0.001436967 0.004476184 0.063320585 0.0217074181
## 7 -0.026247250 0.004485714 0.019948065 0.007220629 0.0033894993
## 8 0.001237080 0.006083265 -0.004996015 0.011557347 0.0009209081
## dfb.CrsA dfb.FnAg dfb.Age dffit cov.r cook.d
## 1 -0.058242394 -0.107926386 -0.020505709 0.24361271 0.9863850 0.0073934384
## 2 -0.011390874 -0.028740157 -0.006523911 0.07266273 1.0175784 0.0006603655
## 3 0.022160057 0.052344646 -0.156161141 -0.21341412 1.0041628 0.0056840286
## 4 0.041451764 0.084711518 -0.376964176 -0.44231843 0.9853367 0.0243219021
## 5 -0.006559903 -0.042653586 -0.231612646 -0.24883226 1.0230871 0.0077321735
## 6 0.003031744 -0.008769008 0.005734749 0.13127080 0.9912484 0.0021500366
## 7 0.047942833 0.087562618 -0.352449942 -0.41884145 0.9903048 0.0218225757
## 8 -0.004795988 -0.014650135 -0.014206739 0.04252859 1.0158985 0.0002262643
## hat
## 1 0.013161891
## 2 0.012722228
## 3 0.016904838
## 4 0.028696448
## 5 0.030130833
## 6 0.005936782
## 7 0.028585244
## 8 0.009392049
pairs(Strength ~ Cement + BFS + Water + Superplast + CoarseAggr + FineAggr + Age)
## Histogram Distribution of Predictors
df <- concrete
df %>%
pivot_longer(cols = colnames(df)) %>%
ggplot() +
geom_histogram(aes(value), bins = 25) +
facet_wrap(~name, scales = 'free') +
theme_minimal() +
theme(axis.title.y = element_blank())
1. Skewness of Variables Variables with Skewed Distributions: Age,
Cement, Strength, Superplast.
Interpretation: Age: The right-skewed distribution suggests that most concrete samples are tested at younger ages (e.g., 0–100 days), with fewer samples at older ages. This implies that older concrete samples are underrepresented, which could bias the model if age has a non-linear relationship with strength.
Cement: The left-skewed distribution indicates that most concrete mixes have higher cement content. This suggests that cement is a critical component, and its effect on strength may be non-linear or plateau at higher levels.
Strength: The right-skewed distribution of the original Strength variable indicates that most concrete samples have lower compressive strength, with fewer samples achieving high strength. This skewness violates the normality assumption for linear regression, which is why the transformation (Strength_transformed) was applied.
Superplast: The right-skewed distribution suggests that low levels of superplasticizer are more common in the dataset. Superplasticizer is used to improve workability, and its effect on strength may be more pronounced at higher levels.
Interpretation: BFS (Blast Furnace Slag): The bimodal distribution suggests that there are two distinct groups of concrete mixes: one with little to no BFS and another with significant amounts of BFS. This could indicate different types of concrete, which may have different strength characteristics.
Water: The bimodal distribution indicates that there are two distinct water content levels in the dataset. This could reflect different mix designs (e.g., high-water vs. low-water mixes), which can significantly affect strength. Higher water content generally reduces strength due to increased porosity, while lower water content improves strength but may reduce workability.
Interpretation: The transformation makes the relationship between Strength and the predictors more linear, which is essential for linear regression models.
This suggests that the original relationship between predictors and strength may have been non-linear, and the transformation helps capture this relationship more accurately.
# Scatter plots to visualize relationships between predictors and strength
concrete %>%
pivot_longer(cols = -Strength, names_to = "predictor", values_to = "value") %>%
ggplot(aes(x = value, y = Strength)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
facet_wrap(~predictor, scales = "free") +
theme_minimal() +
labs(title = "Scatter Plots of Predictors vs. Strength", x = "Predictor Value", y = "Strength")
## `geom_smooth()` using formula = 'y ~ x'
## Density Plot
# Load necessary libraries
library(tidyverse)
# Create histograms for all variables
concrete %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(bins = 25, fill = "blue", color = "black") +
facet_wrap(~name, scales = "free") +
theme_minimal() +
labs(title = "Histograms of Concrete Dataset Variables", x = "Value", y = "Frequency")
Based on the distributions and their implications, the following variables are likely to have the greatest effect on concrete strength:
Cement: The left-skewed distribution indicates that cement is a key component in most concrete mixes. Higher cement content generally leads to higher strength, but the effect may plateau at very high levels.
Water: The bimodal distribution suggests that water content has a significant impact on strength. Lower water content typically results in higher strength, while higher water content reduces strength.
Age: The right-skewed distribution indicates that older concrete samples are underrepresented, but age is known to have a strong positive effect on strength (concrete gains strength over time).
BFS: The bimodal distribution suggests that slag-based concrete (with higher BFS content) may have different strength characteristics compared to ordinary Portland cement.
pairs(df)
par(mfrow = c(2, 4))
rstand <- rstandard(model_transformed)
plot(rstand ~ Cement)
plot(rstand ~ BFS)
plot(rstand ~ Water)
plot(rstand ~ Superplast)
plot(rstand ~ CoarseAggr)
plot(rstand ~ FineAggr)
plot(rstand ~ Age)
plot(rstand ~ model_transformed$fitted.values)
## Added Variable Plot for Each Variable
par(mfrow = c(1, 3))
avPlots(model_transformed)
Relationship: The scatter plot likely shows a positive trend, where strength increases with age. This is expected because concrete gains strength over time as it cures. Interpretation: Older concrete samples tend to have higher strength, but the relationship may plateau at very high ages.
Relationship: The scatter plot likely shows a positive trend, where strength increases with cement content. This is expected because cement is a primary binding agent in concrete. Interpretation: Higher cement content generally leads to higher strength, but the effect may plateau at very high cement levels.
Relationship: The scatter plot may show a weak or flat trend, indicating that coarse aggregate has a limited direct effect on strength. Interpretation: Coarse aggregate primarily provides bulk and stability to the concrete mix, but its effect on strength is less pronounced compared to other components like cement or water.
Relationship: The scatter plot may show a weak or flat trend, similar to coarse aggregate. Interpretation: Fine aggregate fills voids and improves workability, but it has a minimal direct effect on strength.
Relationship: The scatter plot may show a positive trend, where strength increases with superplasticizer content. Interpretation: Superplasticizer improves workability and can enhance strength by allowing for a lower water-cement ratio. However, its effect may be more pronounced at moderate levels.
Relationship: The scatter plot likely shows a negative trend, where strength decreases with higher water content. Interpretation: Excess water increases porosity and reduces the density of concrete, leading to lower strength. A lower water-cement ratio generally results in higher strength.
Strongest Positive Effects: Cement and Age have the strongest positive relationships with strength.
Strongest Negative Effect: Water has a strong negative relationship with strength.
Moderate Effects: BFS and Superplast show moderate effects, with potential non-linear or bimodal relationships.
Weak Effects: CoarseAggr and FineAggr have minimal direct effects on strength.