GAMs are effectively a nonparametric form of regression where the \(\beta x_i\) of a linear regression is replaced by a smooth function of the explanatory variables \(f(x_i)\), so:
\(y_i \sim f(x_i) + \epsilon\)
GAM’s
\(E[y_i] = \beta_0 + f(x_i)\)
where \(f\) is the smooth function
\(y_i \sim DistExpoFam(E[y_i],...)\)
How to run?
library(mgcv)gam_model <-gam(Sources ~s(SampleDepth), data = isit2)summary(gam_model)
How to run?
library(mgcv)gam_model <-gam(Sources ~s(SampleDepth), data = isit2)summary(gam_model)
How can we avoid overfitting in GLMs and particularly GAMs?
If GAMs can deal with both linear and not linear data why not just run a GAM instead of a GLM?
How can we assess the fit of a Generalize linear model (GLM) or Generalize additive model (GAM)?
How do I interpret the results of a generalized additive model, especially when we use smooth functions? you will do this for the assignment… do not discuss right now
What does an actual equation in the GAM formulation look like? Instead of just g(u)=f(X)
How is scatterplot smoothing not just overfitting the data?
It seems like a GAM is just a GLM with a transformation to help smooth out the data. Could you not just transform the data yourself in the GLM? Would the GAM transform the data in a way that is not appropriate for everyone’s analysis?
What does “smooth functions of features” mean and how is it different from linear features?
How do we select number of splines or knots? How do we select basis functions
Is there some approach to model selection to approximate an appropriate basis function from the basis space?
Discussion: today
If GAMs can deal with both linear and not linear data why not just run a GAM instead of a GLM?
When to run a GAM vs a GLM?
How can we avoid overfitting in GLMs and particularly GAMs?
How is scatterplot smoothing not just overfitting the data?
Season 2
How to plot this?
Season 2
gam_model <-gam(Sources ~s(SampleDepth), data = isit2)ggplot(data = isit2, aes(y = Sources, x = SampleDepth)) +geom_point() +geom_line(aes(y =fitted(gam_model)), colour="blue", size=1.2)+theme_classic()
Season 1
Do the same thing for season #1
Run the GAM and compare it to a linear model
Are they overfitted?
Some of you said yes!
Some of you said, how is this different than overfitting a linear model?
What is overfitting?
Discussion
How can we avoid overfitting in GLMs and GAMs?
How is scatterplot smoothing not just overfitting the data?
What is overfitting?
Overfitting
What is overfitting?
What if data is “wiggly”? Because the underlying hidden process (parameter/ population) data IS wiggly?
Remember… it is all about the underlying real process (AKA real population parameter)
Splines: These are the smooth, flexible functions used to approximate relationships between predictor variables and the response variable. They are piecewise polynomial functions that fit different segments of the data while ensuring smooth transitions between them. In GAMs, splines are commonly used to model non-linear relationships.
Knots: These are the specific points in the range of the predictor variable where the pieces of the spline function connect. The number and placement of knots determine the flexibility of the spline. More knots allow for more flexibility, but too many can lead to overfitting.
Piecewise polynomial?
Smoother than the piecewise and CONNECTED!
How many knots?
How do we select number of splines or knots?
This have 5 knots… but that’s not usually what it’s done
We (or the package) control the model’s smoothness by adding a “wiggleness” penalty
How can we choose the number of knots? And where to place them
Number of knots
How to choose where and how many knots?
Is there a good way?
We use a term \(\lambda\) to penalize for wiggliness (Monday!) Many ways to estimate it… but what is important is that we are NOT TRYING TO MINIMIZE residuals!
More knots potentially means more ‘wiggliness’
Decide number of knots
The odds of you knowing beforehand the number of knots and where to put them is somewhere between slim and none
We could do it by hand… or let the package do it. There are many ways that “penalties” can get decided. I would not estimate them by hand
Number of knots and penalties
Penalize for wiggliness which affects the base polynomial and the number of knots
The penalty regards the complexity of the model, and specifically the size of the coefficients for the smooth terms. The practical side is that it will help to keep us from overfitting the data, where our smooth function might get too wiggly.
Balance between minimizing “residuals” and wiggleness
Essentially, more EDF imply more complex, wiggly splines
EDF close to 1 –> close to linear term (probably better running a linear model)!
Can I do it myself?
It seems like a GAM is just a GLM with a transformation to help smooth out the data. Could you not just transform the data yourself in the GLM? Would the GAM transform the data in a way that is not appropriate for everyone’s analysis?
Run a model with Season (season should be a factor!), Sample Depth (smooth), and relative depth (linear) and explore it
Analysis of variance (ANOVA)
anova(model_seasonsrel)
Family: gaussian
Link function: identity
Formula:
Sources ~ Season + s(SampleDepth) + RelativeDepth
Parametric Terms:
df F p-value
Season 1 160.6 < 2e-16
RelativeDepth 1 22.3 2.76e-06
Approximate significance of smooth terms:
edf Ref.df F p-value
s(SampleDepth) 8.699 8.974 132.5 <2e-16
Interpreting a GAM
GAM vs GLM debate. What is easier to interpret?
I personally would not run a GAM unless absolutely necessary!
How can I modify the smoothness?
The smoothness is chosen using a cross-validation (prediction error) or likelihood based method.
You generally won’t be selecting it (unless you do it by hand and develop your own optimization technique).
This package offers multiple methods for smoothing.
How can I modify the smoothness?
How can I modify the smoothness?
I tried all of them in multiple datasets and the differences are minuscule. I prefer REML because it uses cross validation and maximum likelihood
How about the basis dimensions?
We can choose these and actually do model selection! The package usually “selects one”. Increasing it gives us “more flexibility” and increases edf (assignment)
More about the gam function
As always… you should read the documentation of the package and the help file!
You can specify a “family” and can run Poisson, Binomial, etc.
Assignment
I uploaded a GAM assignment to Canvas.
Announcements
Sorry about the grading!
I teach 3 (4) classes this semester and have been drowning in work. Will get to it!
Grades secondary, but I want to give you good feedback.
Final Exam
Let’s talk about it
Other topics
Bayesian statistics. We will start talking about it