GAMs are effectively a nonparametric form of regression where the \(\beta x_i\) of a linear regression is replaced by a smooth function of the explanatory variables \(f(x_i)\), so:
\(y_i \sim f(x_i) + \epsilon\)
GAM’s
\(E[y_i] = \beta_0 + f(x_i)\)
where \(f\) is the smooth function
\(y_i \sim DistExpoFam(E[y_i],...)\)
How to run?
library(mgcv)gam_model <-gam(Sources ~s(SampleDepth), data = isit2)summary(gam_model)
How to run?
library(mgcv)gam_model <-gam(Sources ~s(SampleDepth), data = isit2)summary(gam_model)
How can we assess the fit of a Generalize linear model (GLM) or Generalize additive model (GAM)?
When to run a GAM vs a GLM?
he key difference is that the linear predictor now incorporates smooth functions of at least some (possibly all) features, represented as f(x)…” What are these features? Are they the variables in your model?
What does an actual equation in the GAM formulation look like? Instead of just g(u)=f(X)
How do generalized additive models (GAMs) provide a more flexible approach?
If a GAM can handle linear and non-linear relationships, why not run a GAM first and pivot to a GLM if necessary?
How is scatterplot smoothing not just overfitting the data?
It seems like a GAM is just a GLM with a transformation to help smooth out the data. Could you not just transform the data yourself in the GLM? Would the GAM transform the data in a way that is not appropriate for everyone’s analysis?
What does “smooth functions of features” mean and how is it different from linear features?
How do we select number of splines or knots?
Is there some approach to model selection to approximate an appropriate basis function from the basis space?
Season 2
How to plot this?
Season 2
gam_model <-gam(Sources ~s(SampleDepth), data = isit2)ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) +geom_point() +geom_line(aes(y =fitted(gam_model)), colour="blue", size=1.2)+theme_classic()
Season 1
Do the same thing for season #1
Run the GAM and compare it to a linear model
Are they overfitted?
Some of you said yes!
Some of you said, how is this different than overfitting a linear model?
What is overfitting?
Discussion
How can we avoid overfitting in GLMs and GAMs?
How is scatterplot smoothing not just overfitting the data?
What is overfitting?
Overfitting
What is overfitting?
What if data is “wiggly”? Because the underlying hidden process (parameter/ population) data IS wiggly?
Remember… it is all about the underlying real process (AKA real population parameter)
This have 5 knots… but that’s not usually what it’s done
We (or the package) control the model’s smoothness by adding a “wiggleness” penalty
How can we choose the number of knots? And where to place them
Number of knots
How to choose where and how many knots?
Is there a good way?
We use a term \(\lambda\) to penalize for wiggliness (Monday!) Many ways to estimate it… but what is important is that we are NOT TRYING TO MINIMIZE residuals!
More knots potentially means more ‘wiggliness’
Decide number of knots
The odds of you knowing beforehand the number of knots and where to put them is somewhere between slim and none
We could do it by hand… or let the package do it. There are many ways that “penalties” can get decided. I would not estimate them by hand
Number of knots and penalties
Penalize for wiggliness which affects the base polynomial and the number of knots
The penalty regards the complexity of the model, and specifically the size of the coefficients for the smooth terms. The practical side is that it will help to keep us from overfitting the data, where our smooth function might get too wiggly.
Balance between minimizing “residuals” and wiggleness
Essentially, more EDF imply more complex, wiggly splines
EDF close to 1 –> close to linear term (probably better running a linear model)!
Can I do it myself?
It seems like a GAM is just a GLM with a transformation to help smooth out the data. Could you not just transform the data yourself in the GLM? Would the GAM transform the data in a way that is not appropriate for everyone’s analysis?