M Numbers: “M10837250 | M10625565 | M10804977 | M10629718”

Date: “December 3, 2016”

Report Summary

Strength = \(\beta_0\) + \(\beta_1\) * cement + \(\beta_2\) * blast + \(\beta_3\) * fly + \(\beta_4\) * water + \(\beta_5\) * superplast + \(\beta_6\) * coarse + \(\beta_7\) * fine + \(\beta_8\) * age

Our Final Model is

Strength = \(\beta_0\) + \(\beta_1\) * cement + \(\beta_2\) * blast + \(\beta_3\) * fly + \(\beta_4\) * water + \(\beta_5\) * age

  • While checking for patterns in residual plot we identified some parabolic pattern. Thus it was not the best fit model to predict the Compressive strength of concrete.

  • To select the suitable transformation we plotted residuals against each regressors. The pattern was highly visible when age is plotted against residual. Thus we applied inverse,squared and logarithmic transform on age and obtained the adjusted R2 value as 81% for logrithimic transform, so we continued with log transform.

  • Continuing with tranformed model we again plot the residual plot and observed no patterns.

  • So, we proceed to check for multi-collinearity and outliers and found none.

  • Predicting the confidence interval for all the co-efficients and also obtaining the predicted values and also obtained the predicted values based on our linear model

Let us begin this by viewing a quick introduction video on concrete

Chapter 1: Data Exploring

1) Data Exploration and Cleaning

The dataset we have is the Concrete dataset from the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) contributed by I-Cheng Yeh.

This data is of 1030 samples of concrete with 8 variables describing the inputs that go into making concrete. A brief explanation of these variables is as follows:-

Name Description Range Measurement
Cement Input Variable (102.0 - 540.0) kg in a m3 mixture
Blast Furnace Slag Input Variable (0.0 - 359.4) kg in a m3 mixture
Fly Ash Input Variable (0.0 - 200.10) kg in a m3 mixture
Water Input Variable (121.8 - 247.0) kg in a m3 mixture
Superplasticizer Input Variable (0.0 - 32.200) kg in a m3 mixture
Coarse Aggregate Input Variable (801.0 - 1145.0) kg in a m3 mixture
Fine Aggregate Input Variable (594.0 - 992.6) kg in a m3 mixture
Age Input Variable (1.0 - 365.0) Days
Concrete compressive strength Output Variable (2.33 - 82.60) MPa
  • Superplasticizers, also known as high range water reducers, are chemical admixtures used where well-dispersed particle suspension is required. These polymers are used as dispersants to avoid particle segregation (gravel, coarse and fine sands), and to improve the flow characteristics (rheology) of suspensions such as in concrete applications. Their addition to concrete or mortar allows the reduction of the water to cement ratio, not affecting the workabilityof the mixture, and enables the production of self-consolidating concrete and high performance concrete.

  • Fly ash, also known as “pulverised fuel ash”, is one of the coal combustion products, composed of the fine particles that are driven out of the boiler with the flue gases.Fly ash is used as a supplementary cementitious material (SCM) in the production of portland cement concrete. A supplementary cementitious material, when used in conjunction with portland cement, contributes to the properties of the hardened concrete through hydraulic or pozzolanic activity, or both

  • Blast Furnace Slag is formed when iron ore or iron pellets, coke and a flux (either limestone or dolomite) are melted together in a blast furnace. Its measurable benefits in concrete include improved workability and finishability, high compressive and flexural strengths, and resistance to aggressive chemicals.

##      Cement      Blast.furnace.slags   Fly.ashes          Water      
##  Min.   :102.0   Min.   :  0.0       Min.   :  0.00   Min.   :121.8  
##  1st Qu.:192.4   1st Qu.:  0.0       1st Qu.:  0.00   1st Qu.:164.9  
##  Median :272.9   Median : 22.0       Median :  0.00   Median :185.0  
##  Mean   :281.2   Mean   : 73.9       Mean   : 54.19   Mean   :181.6  
##  3rd Qu.:350.0   3rd Qu.:142.9       3rd Qu.:118.30   3rd Qu.:192.0  
##  Max.   :540.0   Max.   :359.4       Max.   :200.10   Max.   :247.0  
##  Superplasticizers Coarse.aggregates Fine.aggregates      Age        
##  Min.   : 0.000    Min.   : 801.0    Min.   :594.0   Min.   :  1.00  
##  1st Qu.: 0.000    1st Qu.: 932.0    1st Qu.:731.0   1st Qu.:  7.00  
##  Median : 6.400    Median : 968.0    Median :779.5   Median : 28.00  
##  Mean   : 6.205    Mean   : 972.9    Mean   :773.6   Mean   : 45.66  
##  3rd Qu.:10.200    3rd Qu.:1029.4    3rd Qu.:824.0   3rd Qu.: 56.00  
##  Max.   :32.200    Max.   :1145.0    Max.   :992.6   Max.   :365.00  
##  Compressive.strength
##  Min.   : 2.33       
##  1st Qu.:23.71       
##  Median :34.45       
##  Mean   :35.82       
##  3rd Qu.:46.13       
##  Max.   :82.60

Chapter 2: Data Visualization

## Loading required package: ggplot2
## Warning in arrange_impl(.data, dots): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored
## Warning: Numeric color variables cannot (yet) be mapped to lines.
##  when the trace type is 'scatter' or 'scattergl'.
## Warning: Numeric color variables cannot (yet) be mapped to lines.
##  when the trace type is 'scatter' or 'scattergl'.

As we can see here, the compressive strength of cement increases rapidly initially with age as the cement dries. At around day 60, it tends to stabilize and from then it it stays the same till the end of the observation period.

attach(concrete)
library(plotly)
library(reshape2)

model <- lm(Compressive.strength ~ Age + Water)

#Graph Resolution
graph_reso <- 0.5

#Setup Axis
axis_x <- seq(min(Age), max(Age), by = graph_reso)
axis_y <- seq(min(Water), max(Water), by = graph_reso)

#Sample points
lm_surface <- expand.grid(Age = axis_x,Water = axis_y,KEEP.OUT.ATTRS = F)
lm_surface$Compressive.strength <- predict.lm(model, newdata = lm_surface)
lm_surface <- acast(lm_surface, Water ~ Age, value.var = "Compressive.strength") #y ~ x


c <- plot_ly(concrete,
        x =~ Age,
        y =~ Water,
        z =~ Compressive.strength, 
        type= "scatter3d",
        mode = "markers",
        color =  Compressive.strength
        )
c <- add_trace(c, 
               z = lm_surface,
               x = axis_x,
               y = axis_y, 
               type = "surface")
c
## Warning: 'surface' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'hoverinfo', 'stream', 'z', 'x', 'y', 'text', 'surfacecolor', 'cauto', 'cmin', 'cmax', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'contours', 'hidesurface', 'lightposition', 'lighting', 'colorbar', '_deprecated', 'scene', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'surfacecolorsrc', 'key'
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

As expected, The strength of concrete increase with addition of cement. Addition of water exhibits a non linear relationship with the strength of concrete.

## Warning: 'surface' objects don't have these attributes: 'mode'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'hoverinfo', 'stream', 'z', 'x', 'y', 'text', 'surfacecolor', 'cauto', 'cmin', 'cmax', 'colorscale', 'autocolorscale', 'reversescale', 'showscale', 'contours', 'hidesurface', 'lightposition', 'lighting', 'colorbar', '_deprecated', 'scene', 'zsrc', 'xsrc', 'ysrc', 'textsrc', 'surfacecolorsrc', 'key'

Water and Superplasticizers show a clear inverse relationship. The time effect of water on the strength of concrete is offset by the use of plasticizers.

Observation

Our Model shows that the theoretical values of 0.4 to 0.6 for water to cement ratio gives the highest values of compressive strength. When the ratio of water increases over 0.6 results in drastically reduced compressive strength.

Chapter 3: Data Modelling

3) Goal: Building a multiple linear model and also checking the significance of variables when combined together

Used the methods of forward selection stepwise regressionand anova comparison of models significance

R Code:

summary(lm(strength ~ cement))$adj.r.squared;
## [1] 0.2471049
summary(lm(strength ~ cement + blast))$adj.r.squared;
## [1] 0.3264808
summary(lm(strength ~ cement + blast + fly))$adj.r.squared;
## [1] 0.4013944
summary(lm(strength ~ cement + blast + fly + water))$adj.r.squared;
## [1] 0.4408639
summary(lm(strength ~ cement + blast + fly + water + superplast))$adj.r.squared;
## [1] 0.4451274
summary(lm(strength ~ cement + blast + fly + water + superplast + coarse))$adj.r.squared;
## [1] 0.4454807
summary(lm(strength ~ cement + blast + fly + water + superplast + coarse + fine))$adj.r.squared;
## [1] 0.4449406
summary(lm(strength ~ cement + blast + fly + water + superplast + coarse + fine + age))$adj.r.squared;
## [1] 0.6125073
full_model <- lm(strength ~ cement + blast + fly + water + superplast + coarse + fine + age);
cement_red_model <- lm(strength ~ blast + fly + water + superplast + coarse + fine + age);
anova(cement_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ blast + fly + water + superplast + coarse + fine + 
##     age
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1022 131952                                  
## 2   1021 110413  1     21539 199.17 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
blast_red_model <- lm(strength ~ cement + fly + water + superplast + coarse + fine + age);
anova(blast_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + fly + water + superplast + coarse + fine + 
##     age
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1022 121769                                  
## 2   1021 110413  1     11356 105.01 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fly_red_model <- lm(strength ~ cement + blast + water + superplast + coarse + fine + age);
anova(fly_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + blast + water + superplast + coarse + fine + 
##     age
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1   1022 115694                                 
## 2   1021 110413  1    5281.1 48.834 5.02e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
water_red_model <- lm(strength ~ cement + blast + fly + superplast + coarse + fine + age);
anova(water_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + blast + fly + superplast + coarse + fine + 
##     age
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1022 111919                                  
## 2   1021 110413  1    1505.7 13.924 0.0002009 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
super_red_model <- lm(strength ~ cement + blast + fly + water + coarse + fine + age);
anova(super_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + blast + fly + water + coarse + fine + age
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
## 1   1022 111471                              
## 2   1021 110413  1    1058.1 9.784 0.00181 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coarse_red_model <- lm(strength ~ cement + blast + fly + water + superplast + fine + age);
anova(coarse_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + blast + fly + water + superplast + fine + 
##     age
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   1022 110814                              
## 2   1021 110413  1    401.01 3.7082 0.05442 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fine_red_model <- lm(strength ~ cement + blast + fly + water + superplast + coarse + age);
anova(fine_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + blast + fly + water + superplast + coarse + 
##     age
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   1022 110798                              
## 2   1021 110413  1    384.93 3.5595 0.05949 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
age_red_model <- lm(strength ~ cement + blast + fly + water + superplast + coarse + fine);
anova(age_red_model, full_model);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine
## Model 2: strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   1022 158315                                  
## 2   1021 110413  1     47902 442.95 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation:

When the fine variable is added, the adjusted R2 is decreased. Also confirmed using Anova, that the fine & coarse variable is not significant.

For the fine & coarse variables, the p-value is > alpha(0.05). So we are failed to reject the null hypothesis. Finally we conclude that they are not significant and can be removed from the model. Let’s rebuild the model and test its significance

3.1) Goal: Confirm the significance of each variable also with the stepAIC backward selection

#Multiple linear model with all the variables
full_model_aic <- lm(strength ~ cement + blast + fly + water + superplast + coarse + fine + age);
##strength = $\beta_0$ + $\beta_1$ * cement + $\beta_2$ * blast + $\beta_3$ * fly + $\beta_4$ * water + $\beta_5$ * superplast + $\beta_6$ * coarse + $\beta_7$ * fine + $\beta_8$ * age
#install.packages("MASS");
library(MASS);
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     cement
## The following object is masked from 'package:plotly':
## 
##     select
step <- stepAIC(full_model_aic, direction="backward");
## Start:  AIC=4832.91
## strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    110413 4832.9
## - fine        1       385 110798 4834.5
## - coarse      1       401 110814 4834.6
## - superplast  1      1058 111471 4840.7
## - water       1      1506 111919 4844.9
## - fly         1      5281 115694 4879.0
## - blast       1     11356 121769 4931.7
## - cement      1     21539 131952 5014.5
## - age         1     47902 158315 5202.1
step$anova;
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
## 
## Final Model:
## strength ~ cement + blast + fly + water + superplast + coarse + 
##     fine + age
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                       1021   110413.2 4832.911
anova(full_model_aic);
## Analysis of Variance Table
## 
## Response: strength
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## cement        1  71173   71173 658.1385 < 2.2e-16 ***
## blast         1  22961   22961 212.3185 < 2.2e-16 ***
## fly           1  21639   21639 200.0939 < 2.2e-16 ***
## water         1  11458   11458 105.9500 < 2.2e-16 ***
## superplast    1   1374    1374  12.7097 0.0003806 ***
## coarse        1    256     256   2.3646 0.1244260    
## fine          1      1       1   0.0066 0.9353645    
## age           1  47902   47902 442.9521 < 2.2e-16 ***
## Residuals  1021 110413     108                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(full_model_aic)$adj.r.squared; #61.25
## [1] 0.6125073
full_model_aicnofine <- lm(strength ~ cement + blast + fly + water + superplast + coarse + age);
step <- stepAIC(full_model_aicnofine, direction = "backward");
## Start:  AIC=4834.5
## strength ~ cement + blast + fly + water + superplast + coarse + 
##     age
## 
##              Df Sum of Sq    RSS    AIC
## - coarse      1        45 110843 4832.9
## <none>                    110798 4834.5
## - superplast  1       887 111685 4840.7
## - water       1      8504 119303 4908.7
## - fly         1      8559 119357 4909.1
## - blast       1     30689 141487 5084.3
## - age         1     47518 158316 5200.1
## - cement      1     64017 174815 5302.2
## 
## Step:  AIC=4832.91
## strength ~ cement + blast + fly + water + superplast + age
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    110843 4832.9
## - superplast  1       875 111718 4839.0
## - fly         1      8537 119381 4907.3
## - water       1     11543 122386 4933.0
## - blast       1     32750 143593 5097.5
## - age         1     47728 158571 5199.7
## - cement      1     66774 177617 5316.6
step$anova;
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## strength ~ cement + blast + fly + water + superplast + coarse + 
##     age
## 
## Final Model:
## strength ~ cement + blast + fly + water + superplast + age
## 
## 
##       Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                           1022   110798.1 4834.495
## 2 - coarse  1 45.12133      1023   110843.2 4832.915
anova(full_model_aicnofine);
## Analysis of Variance Table
## 
## Response: strength
##              Df Sum Sq Mean Sq  F value    Pr(>F)    
## cement        1  71173   71173 656.4944 < 2.2e-16 ***
## blast         1  22961   22961 211.7881 < 2.2e-16 ***
## fly           1  21639   21639 199.5941 < 2.2e-16 ***
## water         1  11458   11458 105.6853 < 2.2e-16 ***
## superplast    1   1374    1374  12.6779  0.000387 ***
## coarse        1    256     256   2.3587  0.124897    
## age           1  47518   47518 438.3015 < 2.2e-16 ***
## Residuals  1022 110798     108                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(full_model_aicnofine)$adj.r.squared; #61.15
## [1] 0.6115369
full_model_aicnofinenocoarse <- lm(strength ~ cement + blast + fly + water + superplast + age);
step <- stepAIC(full_model_aicnofinenocoarse, direction = "backward");
## Start:  AIC=4832.91
## strength ~ cement + blast + fly + water + superplast + age
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    110843 4832.9
## - superplast  1       875 111718 4839.0
## - fly         1      8537 119381 4907.3
## - water       1     11543 122386 4933.0
## - blast       1     32750 143593 5097.5
## - age         1     47728 158571 5199.7
## - cement      1     66774 177617 5316.6
step$anova;
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## strength ~ cement + blast + fly + water + superplast + age
## 
## Final Model:
## strength ~ cement + blast + fly + water + superplast + age
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                       1023   110843.2 4832.915
anova(full_model_aicnofinenocoarse);
## Analysis of Variance Table
## 
## Response: strength
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## cement        1  71173   71173 656.869 < 2.2e-16 ***
## blast         1  22961   22961 211.909 < 2.2e-16 ***
## fly           1  21639   21639 199.708 < 2.2e-16 ***
## water         1  11458   11458 105.746 < 2.2e-16 ***
## superplast    1   1374    1374  12.685 0.0003855 ***
## age           1  47728   47728 440.495 < 2.2e-16 ***
## Residuals  1023 110843     108                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(full_model_aicnofinenocoarse)$adj.r.squared; #61.18
## [1] 0.6117586
full_model_aicnofinenocoarsenosuperplast <- lm(strength ~ cement + blast + fly + water + age);
step <- stepAIC(full_model_aicnofinenocoarsenosuperplast, direction = "backward");
## Start:  AIC=4839.01
## strength ~ cement + blast + fly + water + age
## 
##          Df Sum of Sq    RSS    AIC
## <none>                111718 4839.0
## - fly     1     15285 127003 4969.1
## - water   1     25350 137068 5047.6
## - blast   1     45229 156947 5187.1
## - age     1     48228 159946 5206.6
## - cement  1     85377 197095 5421.8
step$anova;
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## strength ~ cement + blast + fly + water + age
## 
## Final Model:
## strength ~ cement + blast + fly + water + age
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                       1024   111718.1 4839.013
anova(full_model_aicnofinenocoarsenosuperplast);
## Analysis of Variance Table
## 
## Response: strength
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## cement       1  71173   71173  652.36 < 2.2e-16 ***
## blast        1  22961   22961  210.46 < 2.2e-16 ***
## fly          1  21639   21639  198.34 < 2.2e-16 ***
## water        1  11458   11458  105.02 < 2.2e-16 ***
## age          1  48228   48228  442.05 < 2.2e-16 ***
## Residuals 1024 111718     109                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(full_model_aicnofinenocoarsenosuperplast)$adj.r.squared; #60.91
## [1] 0.6090761
full_model_aicnofinenocoarsenosuperplastnowater <- lm(strength ~ cement + blast + fly + age);
step <- stepAIC(full_model_aicnofinenocoarsenosuperplastnowater, direction = "backward");
## Start:  AIC=5047.65
## strength ~ cement + blast + fly + age
## 
##          Df Sum of Sq    RSS    AIC
## <none>                137068 5047.6
## - fly     1     30893 167961 5255.0
## - age     1     34335 171403 5275.9
## - blast   1     49302 186370 5362.1
## - cement  1    112616 249684 5663.4
step$anova;
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## strength ~ cement + blast + fly + age
## 
## Final Model:
## strength ~ cement + blast + fly + age
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                       1025   137068.3 5047.648
anova(full_model_aicnofinenocoarsenosuperplastnowater);
## Analysis of Variance Table
## 
## Response: strength
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## cement       1  71173   71173  532.23 < 2.2e-16 ***
## blast        1  22961   22961  171.70 < 2.2e-16 ***
## fly          1  21639   21639  161.81 < 2.2e-16 ***
## age          1  34335   34335  256.76 < 2.2e-16 ***
## Residuals 1025 137068     134                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(full_model_aicnofinenocoarsenosuperplastnowater)$adj.r.squared; #52.08
## [1] 0.5208389

Observation: We need to look for the small stepAIC and maximum Adjusted R square to get the best fit model with significant variables. Fromm the above when we remove the fine aggregate, coarse aggregate and superplast, there are small variations, but yet we maintain the minimum stepAIC and max Adj R2. But when we remove the water, the adjusted R2 dropped drastically 60.91 to 52.08. Therefore we will retain water in our model.

Strength = \(\beta_0\) + \(\beta_1\) * cement + \(\beta_2\) * blast + \(\beta_3\) * fly + \(\beta_4\) * water + \(\beta_5\) * age

3.2)Goal: Rebuild the model after removing the insignificant variables.

R code:

full_model_v1 <- lm(strength ~ cement + blast + fly + water + age);
summary(full_model_v1)$adj.r.squared;
## [1] 0.6090761
summary(full_model)$adj.r.squared;
## [1] 0.6125073

Observation:

There is no significant change in Adjusted R2 when the fine, coarse aggregate and superplast variables are removed.

Chapter 4: Model Checking

4) Check if there is any pattern in residual plot and also for its normal distribution.

R Code:

par(mfrow=c(2,2))
plot(full_model_v1);

qqnorm(full_model_v1$residuals, main = "residuals Probability plot")
qqline(full_model_v1$residuals)
plot(cement, full_model_v1$residuals);
abline(h=0, lwd=2, col='red');
plot(blast, full_model_v1$residuals);
abline(h=0, lwd=2, col='red');
plot(fly, full_model_v1$residuals);
abline(h=0, lwd=2, col='red');

plot(water, full_model_v1$residuals);
abline(h=0, lwd=2, col='red');
plot(superplast, full_model_v1$residuals);
abline(h=0, lwd=2, col='red');
plot(age, full_model_v1$residuals);
abline(h=0, lwd=2, col='red');

Observation:

There is very small pattern observed in fitted vs residual plot. If we check the plot for each individual variables against the residuals, there is a pattern for age vs residual. Hence we can apply transformation to age to get the more best fitted model.

Chapter 5: Remodelling and Model Checking

5) Goal: Applying Transformations: check what type of transformation best suits the age variable

Taken the inverse, log and square of the age variable and plotted them against the residuals

R code:

age_inv <- 1/(age);
full_model_v1_age_inv <- lm(strength ~ cement + blast + fly + water + superplast + age_inv);
summary(full_model_v1_age_inv)$adj.r.squared;
## [1] 0.7311373
age_log <- log(age);
full_model_v1_age_log <- lm(strength ~ cement + blast + fly + water + superplast + age_log);
summary(full_model_v1_age_log)$adj.r.squared;
## [1] 0.8130099
age_sq <- (age)^2;
full_model_v1_age_sq <- lm(strength ~ cement + blast + fly + water + superplast + age_sq);
summary(full_model_v1_age_sq)$adj.r.squared;
## [1] 0.5015163

Observation:

With the log transformation, the adjusted R2 is increased from 61% to 81.3%. And also the the p-value of the superplast variable is more than 0.05 and makes it as insginificant variable.

Thus, we can remove that superplast variable and rebuild the model and can check the adj R2. If there is no change, we can go ahead with this new model and will check for the residual plot pattern.

5.1) Goal: Rebuild the model and confirm whether we can remove superplast variable

R Code:

full_model_v2 <- lm(strength ~ cement + blast + fly + water + age_log);
summary(full_model_v2);
## 
## Call:
## lm(formula = strength ~ cement + blast + fly + water + age_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1457  -4.4504  -0.1221   4.3439  29.4486 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.246249   2.541682   5.212 2.26e-07 ***
## cement       0.110539   0.002719  40.648  < 2e-16 ***
## blast        0.086829   0.003126  27.774  < 2e-16 ***
## fly          0.062713   0.004637  13.523  < 2e-16 ***
## water       -0.252034   0.011350 -22.205  < 2e-16 ***
## age_log      8.668383   0.191823  45.190  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.223 on 1024 degrees of freedom
## Multiple R-squared:  0.814,  Adjusted R-squared:  0.8131 
## F-statistic: 896.2 on 5 and 1024 DF,  p-value: < 2.2e-16
anova(full_model_v2, full_model_v1_age_log);
## Analysis of Variance Table
## 
## Model 1: strength ~ cement + blast + fly + water + age_log
## Model 2: strength ~ cement + blast + fly + water + superplast + age_log
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1   1024 53418                           
## 2   1023 53386  1     32.22 0.6174 0.4322

Observation:

The Adjusted R2 remains unchanged. So we can remove the superplast and now all the variables are significant which can be identified from the p-values. Also from the the Anova, we can conclude that superplast variable is not significant.

5.2)Goal: Plot again the residual plots and check if there is any pattern

R Code:

par(mfrow=c(2,2))
plot(full_model_v2);

qqnorm(full_model_v2$residuals, main = "residuals Probability plot")
qqline(full_model_v2$residuals)
plot(cement, full_model_v2$residuals);
abline(h=0, lwd=2, col='red');
plot(blast, full_model_v2$residuals);
abline(h=0, lwd=2, col='red');
plot(fly, full_model_v2$residuals);
abline(h=0, lwd=2, col='red');

plot(water, full_model_v2$residuals);
abline(h=0, lwd=2, col='red');
plot(superplast, full_model_v2$residuals);
abline(h=0, lwd=2, col='red');
plot(age_log, full_model_v2$residuals);
abline(h=0, lwd=2, col='red');

Observation:

If we see the fitted vs residual plot, we can see that there is no identified pattern. Also from the normal probability curve of the residuals, we can see that they are maximum normally distributed.

Linear Equation: strength = 13.246249 + 0.110539 * cement + 0.08629 * blast + 0.062713 * fly - 0.252034 * water + 8.668383 * age_log;

Chapter 6: Model Adequacy Checking

6)Goal: Check if there is any multicollinearity

R Code:

library(car);
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
vif(full_model_v2);
##   cement    blast      fly    water  age_log 
## 1.593145 1.435087 1.737411 1.158828 1.030339

Observation:

None of the value is > 5. So we can conclude that our model is not suffering from multicollinearity

6.1)Goal: Check for extrapolation, leverage, standardized & press residual and also influence points. Deduce if there are any outliers

R Code:

#standardized residuals
SSRes=sum((full_model_v2$residuals-mean(full_model_v2$residuals))^2);
MSRes=SSRes/(1030-3);
standardized_res=full_model_v2$residuals/sqrt(MSRes);
# PRESS residuals
PRESS_res=full_model_v2$residuals/(1 - lm.influence(full_model_v2)$hat);
par(mfrow=c(2,2));
plot(full_model_v2$fitted.values,full_model_v2$residuals,pch=20,ylab="residual",xlab="fitted value");
abline(h=0,col="grey")
plot(full_model_v2$fitted.values,standardized_res,pch=20,ylab="standardized residual",xlab="fitted value");
abline(h=0,col="grey");
plot(full_model_v2$fitted.values,PRESS_res,pch=20,ylab="PRESS residual",xlab="fitted value");
abline(h=0,col="grey");
plot(full_model_v2$fitted.values,lm.influence(full_model_v2)$hat,pch=20,ylab="leverage",xlab="fitted value");

which(lm.influence(full_model_v2)$hat > 2 *(5/1030)); #2(p/n)
##    4    7   10   13   18   20   21   25   26   27   30   34   35   36   39 
##    4    7   10   13   18   20   21   25   26   27   30   34   35   36   39 
##   40   42   43   46   49   57   64   67   75   77   80   85   91   98  103 
##   40   42   43   46   49   57   64   67   75   77   80   85   91   98  103 
##  121  126  144  146  149  167  169  172  182  225  226  227  228  229  500 
##  121  126  144  146  149  167  169  172  182  225  226  227  228  229  500 
##  501  502  503  504  505  506  507  509  513  554  560  572  585  611  646 
##  501  502  503  504  505  506  507  509  513  554  560  572  585  611  646 
##  653  667  689  700  747  748  756  757  764  799  800  816  821  863  874 
##  653  667  689  700  747  748  756  757  764  799  800  816  821  863  874 
##  891  909  924  927  928  929  931  933  934  935  936  937  954  972 1020 
##  891  909  924  927  928  929  931  933  934  935  936  937  954  972 1020
which(standardized_res > 3 | standardized_res < -3);
##   9  15 115 382 384 
##   9  15 115 382 384
which(cooks.distance(full_model_v2) > 1);
## named integer(0)

Observation:

To check the outliers, we identified the indexes that has high leverage points hii > 2p/n. And we checked whethere these indexes, has large residual standardized residual > |3|. None has satisfied that.

Also we confirmed from the Cook’s D to measure the influece as which index has > 1. We got nothing. So we can confirm that our model is not suffering from outliers.

Chapter 7: Finalizing the model

7)Goal: Build the confidence interval, predict the values and plot it

Strength = 13.246249 + 0.110539 * cement + 0.08629 * blast + 0.062713 * fly - 0.252034 * water + 8.668383 * age_log;

R Code:

##                   2.5 %      97.5 %
## (Intercept)  8.25874816 18.23375017
## cement       0.10520248  0.11587490
## blast        0.08069410  0.09296313
## fly          0.05361349  0.07181341
## water       -0.27430650 -0.22976100
## age_log      8.29197172  9.04479356
##  [1] 60.99250 60.99250 53.43901 56.05232 49.32786 34.09038 57.17854
##  [8] 34.92080 23.96913 37.17326
## Warning in predict.lm(full_model_v2, interval = "prediction"): predictions on current data refer to _future_ responses
##  [1] 60.99250 60.99250 53.43901 56.05232 49.32786 34.09038 57.17854
##  [8] 34.92080 23.96913 37.17326

Observation:

We got the confidence interval range that the true value can come in that range with 95% confidence. Also plotted the confidence and the prediction ranges.

Ref:Modeling of strength of high performance concrete using artificial neural networks, Cement and Concrete Research, Vol. 28, pp. 1797-1808, by I-C Yeh (1998). Video from Professor Nick Buenfeld, Head of Department at Imperial College Department of Civil and Environmental Engineering.