Generalized additive models (GAMs) allow us to capture relationships
between predictors and outcome variables, using non-linear smooth
functions with varying shapes. GAMs capture complex relationships much
more easily than linear regression models, while still being relatively
easy to interpret and plot. GAMs can also derive the importance of the
predictors from the data, pick smoothing parameters for them, and apply
shrinkage to select out unimportant predictors.
In this analysis, we have a dataset
of concrete compressive strength measurements, and we will try to
predict compressive strength using the quantities of different
components.
Let’s load and view our original dataset.
| Concrete strength original dataset | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| cement | slag | flyash | water | superplasticizer | coarseaggregate | fineaggregate | age | csMPa | |
| 1 | 540.0 | 0.0 | 0 | 162 | 2.5 | 1040.0 | 676.0 | 28 | 79.99 |
| 2 | 540.0 | 0.0 | 0 | 162 | 2.5 | 1055.0 | 676.0 | 28 | 61.89 |
| 3 | 332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 270 | 40.27 |
| 4 | 332.5 | 142.5 | 0 | 228 | 0.0 | 932.0 | 594.0 | 365 | 41.05 |
| 5 | 198.6 | 132.4 | 0 | 192 | 0.0 | 978.4 | 825.5 | 360 | 44.30 |
All variables are numeric. All columns except the last two are concrete components, in units of kilograms per m^3 of mixture. Below is the description of each column.
Besides the component variables, we have the age of the concrete
mixture in days, and concrete compressive strength, our outcome
variable, in units of megapascals (MPa).
For ease of coding and plotting, we will rename some of the variables.
We will abbreviate superplasticizer, coarse aggregate and fine
aggregate, while renaming the outcome variable to strength.
colnames(df_org)[5:7] <- c("SP", "CA", "FA")
colnames(df_org)[9] <- "strength"
Here is some basic information about concrete components:
This information likely means that we should consider the total
amounts of each component category as our main predictor variables. We
should also consider the balances between them, as well as the balance
of individual components in each component category. For these reasons,
we will calculate some new variables.
| Concrete strength dataset, with new variables | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cement | slag | flyash | total_cem | water | SP | total_water | CA | FA | total_agg | age | wc_ratio | strength | |
| 1 | 540.0 | 0.0 | 0 | 540 | 162 | 2.5 | 164.5 | 1040.0 | 676.0 | 1716.0 | 28 | 0.3046 | 79.99 |
| 2 | 540.0 | 0.0 | 0 | 540 | 162 | 2.5 | 164.5 | 1055.0 | 676.0 | 1731.0 | 28 | 0.3046 | 61.89 |
| 3 | 332.5 | 142.5 | 0 | 475 | 228 | 0.0 | 228.0 | 932.0 | 594.0 | 1526.0 | 270 | 0.4800 | 40.27 |
| 4 | 332.5 | 142.5 | 0 | 475 | 228 | 0.0 | 228.0 | 932.0 | 594.0 | 1526.0 | 365 | 0.4800 | 41.05 |
| 5 | 198.6 | 132.4 | 0 | 331 | 192 | 0.0 | 192.0 | 978.4 | 825.5 | 1803.9 | 360 | 0.5801 | 44.30 |
Our new variables are the following:
Let’s explore the distributions, correlations and relationships in our dataset, considering both the original variables, and our calculated variables.
Let’s start with the distributions in our dataset.
Let’s check the correlations between our original variables.
Correlations with the outcome variable, compressive strength:
Correlations between the predictor variables:
As we can see, the original variables don’t display a particularly
strong (linear) relationship with strength, and the shapes of the
relationships are not easy to interpret. Let’s see the correlations of
our calculated variables.
Correlations with the outcome variable, compressive strength:
Correlations among our new predictors:
These relationships suggest the amount of a certain component type is constrained by the amounts of other types, as we would expect.
To better visualize the relationships of our components with
strength, especially depending on the amounts of other components, we
can use 3D scatterplots.
Again, a lower total water content generally leads to stronger mixes,
between the ranges of 140-200, while total aggregate content doesn’t
have much effect besides the extreme low and high ends.
We saw the strength of concrete depends on numerous components, and
viable mixtures likely have different tradeoffs between certain
components. It’s very likely that the effects of our predictors are
mediated by the values of other predictors, i.e, there is interaction
between the predictors.
With numeric variables, we can use conditional plots to check for
interactions. In the plots below, the relationship of one predictor with
strength is plotted several times, each for a given range of another
predictor. If the relationship between predictor one and strength
changes based on the range of predictor two, we can say there is a
considerable interaction between the two predictors. Let’s start by
plotting the interactions of our cement-type components with the total
amounts of cement-type components.
Each range of the given predictor, symbolized by the gray bars above,
corresponds with each scatterplot of the relationship of the other
predictor and the outcome.The total amount of cement-type components
clearly has a bigger effect on mixture strength when it’s made up of
more cement. This suggests a certain amount of regular cement is
necessary for the strongest mixes.
Different amounts of slag do not impact the effect of total cement by a
lot, but there are some notable differences.
The amount of fly ash does not greatly impact the effect of total cement
either.
Water levels greatly impact the relationship between total water-type
components and strength.
SP levels also have a big impact on the relationship between total
water-type components and strength.
The CA levels greatly affect the relationship between total aggregate
content and strength.
FA levels also have a big impact on the relationship between total
aggregate content and strength.
The total amounts of water-type components don’t seem to affect the
relationship between total amounts of cement-type components and
strength by much. This is a surprising result, though there are some
slight differences.
The total aggregate content affects the relationship between total
cement and strength, but again, not by much.
The total aggregate content affects the relationship between total water
content and strength considerably.
Now let’s fit our GAMs to predict the concrete compressive strength, as a function of the mixture components and age.
Let’s fit our first model, gam1, which will consist of:
GAMs can be specified with many options, and we will use the following for our model:
Let’s fit our model, and view the model summary.
gam1 <- gam(strength ~
te(total_cem, total_water, total_agg) +
ti(total_cem, cement, slag, flyash) +
ti(total_water, water, SP) +
ti(total_agg, CA, FA) +
s(age),
data=df_org, method="REML", select=TRUE)
summary(gam1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## strength ~ te(total_cem, total_water, total_agg) + ti(total_cem,
## cement, slag, flyash) + ti(total_water, water, SP) + ti(total_agg,
## CA, FA) + s(age)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.783 1.148 31.18 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(total_cem,total_water,total_agg) 29.126 122 3.867 <0.0000000000000002
## ti(total_cem,cement,slag,flyash) 58.228 179 2.676 <0.0000000000000002
## ti(total_water,water,SP) 14.701 63 1.175 <0.0000000000000002
## ti(total_agg,CA,FA) 19.670 64 1.166 <0.0000000000000002
## s(age) 8.301 9 429.469 <0.0000000000000002
##
## te(total_cem,total_water,total_agg) ***
## ti(total_cem,cement,slag,flyash) ***
## ti(total_water,water,SP) ***
## ti(total_agg,CA,FA) ***
## s(age) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.914 Deviance explained = 92.5%
## -REML = 3322.9 Scale est. = 23.965 n = 1030
The output is somewhat similar to a linear regression output, though there are no coefficients displayed for model terms, as each term has not one but numerous smooth functions that predict its effect.
Let’s plot the predicted values of concrete strength against the
observed values in the data.
The model fits the data well overall, though there are a few
observations with considerably inaccurate predictions. The model
predicts zero, close to zero or even negative strength for a few
low-strength mixtures. This could be an indication our model places a
little more importance on our predictors than they actually have.
Let’s check if our model fits the assumptions, which is done in a very
similar way to linear regression.
##
## Method: REML Optimizer: outer newton
## full convergence after 17 iterations.
## Gradient range [-0.00001617009,0.0001075998]
## (score 3322.919 & scale 23.96492).
## eigenvalue range [-0.00009809676,515.2561].
## Model rank = 441 / 441
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## te(total_cem,total_water,total_agg) 124.0 29.1 0.85 <0.0000000000000002 ***
## ti(total_cem,cement,slag,flyash) 179.0 58.2 1.01 0.66
## ti(total_water,water,SP) 64.0 14.7 0.89 <0.0000000000000002 ***
## ti(total_agg,CA,FA) 64.0 19.7 0.90 <0.0000000000000002 ***
## s(age) 9.0 8.3 0.77 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One more thing we have to diagnose is the presence of concurvity in the
model. Concurvity can be thought of as “multicollinearity for GAM”. The
presence of strong concurvity between two model terms means one can be
approximated from the smooths of another.
## para te(total_cem,total_water,total_agg)
## worst 1 1.0000
## observed 1 0.9995
## estimate 1 0.9923
## ti(total_cem,cement,slag,flyash) ti(total_water,water,SP)
## worst 1.0000 1.0000
## observed 0.9538 0.9998
## estimate 0.8944 0.9991
## ti(total_agg,CA,FA) s(age)
## worst 1.0000 0.5259
## observed 0.9997 0.1297
## estimate 0.9858 0.3490
Above, we have the overall concurvity of each model term with the other terms, 0 being no concurvity and 1 being highest. All model terms except age have very high overall concurvity, meaning they can be strongly approximated using other model terms.
Of course, one of the main benefits of GAMs is the ability to plot the smooths fitted for model terms, visualizing the complex, non-linear relationships between predictors and the outcome. We will do this later using 3D plots, but for now, let’s fit a simpler, more robust model that doesn’t suffer from concurvity.
With gam1, we fit a complex model with many interactions between the components. Now, let’s go the other way, and fit a very simple model, and see how it performs compared to our first model. We will simply predict concrete strength using the main effects of the water/cement ratio, and concrete age.
gam2 <- gam(strength ~
s(wc_ratio) +
s(age),
data=df_org, method="REML", select=TRUE)
summary(gam2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## strength ~ s(wc_ratio) + s(age)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.8180 0.2338 153.2 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(wc_ratio) 7.318 9 250.9 <0.0000000000000002 ***
## s(age) 7.660 9 211.0 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.798 Deviance explained = 80.1%
## -REML = 3575.3 Scale est. = 56.307 n = 1030
This model certainly takes much less time to compute.
Let’s plot the predicted vs. observed values, together with those from
gam1, and compare the fits.
While gam2’s predictions perform well, the overall error is considerably
higher than gam1’s predictions.
Let’s diagnose gam2 the same way we diagnosed gam1.
##
## Method: REML Optimizer: outer newton
## full convergence after 14 iterations.
## Gradient range [-0.001865326,0.001065966]
## (score 3575.325 & scale 56.30727).
## eigenvalue range [-0.0001923882,514.5508].
## Model rank = 19 / 19
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(wc_ratio) 9.00 7.32 0.63 <0.0000000000000002 ***
## s(age) 9.00 7.66 0.70 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s check the concurvity of gam2 model terms, which should be much less of a problem compared to gam1.
## para s(wc_ratio) s(age)
## worst 0 0.0828 0.0828
## observed 0 0.0515 0.0145
## estimate 0 0.0438 0.0488
We see that concurvity is practically non-existent in gam2.
Finally, we can compare the two models’ AIC scores. A lower AIC score indicates a better tradeoff between model accuracy and complexity.
AIC(gam1, gam2)
## df AIC
## gam1 149.64565 6354.052
## gam2 18.06652 7094.791
Despite having many more degrees of freedom (higher complexity), gam1 still has a lower AIC score than gam2.
This also tells us a third model between gam1 and gam2 in terms of complexity is unlikely to perform much better. So let’s move on to plotting the relationships modeled by gam1 and gam2, using 3D plots.
We will now visualize the smooths fitted for the relationships
between our predictor variables, and our outcome variable, strength. We
could do this one by one with 2D line plots, but since there are
numerous significant interactions between our component variables,
plotting the relationships with 3D plots is likely to be more
insightful.
For the relationships between the total component variables and age, we
will use surface plots. We will have two predictors as the X and Y
variables, and their values will include their full ranges from our
dataset. We will predict a concrete strength value for every combination
of X and Y. These predictions for strength will make up our Z variable,
and the surface of our plots.
For the relationships between the total components and their sub-components, we will use scatterplots. The predictors outside of the plot will be held constant at their median values. Every combination of the sub-component variables will sum up to their total component variable, to ensure we don’t have unrealistic predictors.
Let’s start with our calculated total components, and their
relationships with concrete strength.
The above predictions are made by varying total_cem and total_water, while holding all other variables constant at their medians, except cement and water. Cement is calculated as total_cem - slag - flyash, while water is calculated as total_water - SP.
This time, we vary the total_cem and total_agg variables:
Varying total_water and total_agg, we see this duo’s balance is less impactful overall than the previous two groupings.
Now, let’s examine the balances of each sub-component, and their effects
on mixture strength, starting with cement-type components.