Generalized Additive Models

\(\text{This article is going to talk about Generalized Additive Models and their implementation in R.}\)

This is also a famous and very flexible technique of fitting and Modelling Non Linear Functions which are more flexible and fits data well. In this technique we simply add Non linear Functions on different variables to the Regression equation. \(\text{That Non linear function can be anything - Cubic Spline , natural Spline ,Smoothing Splines and even polynomial function}\)

\[f(x) = y_i = \alpha \ + \ f_1(x_1) \ + f_2(x_2) + \ .... + \ f_p(x_p) \ + \epsilon_i \]

\[\text{where} \ f_p(x_p) \ \text {is a Non Linear function on} \ x_p \ variables.\]

Requiring the ‘gam’ package which helps in fitting Generalized Additive Models.

#requiring the Package 
require(gam)

#ISLR package contains the 'Wage' Dataset
require(ISLR)
attach(Wage) #Mid-Atlantic Wage Data

?Wage # To search more on the dataset

gam1<-gam(wage~s(age,df=6)+s(year,df=6)+education ,data = Wage)
#in the above function s() is the shorthand for fitting smoothing splines in gam() function
summary(gam1)

## 
## Call: gam(formula = wage ~ s(age, df = 6) + s(year, df = 6) + education, 
##     data = Wage)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -119.89  -19.73   -3.28   14.27  214.45 
## 
## (Dispersion Parameter for gaussian family taken to be 1235.516)
## 
##     Null Deviance: 5222086 on 2999 degrees of freedom
## Residual Deviance: 3685543 on 2983 degrees of freedom
## AIC: 29890.31 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(age, df = 6)     1  200717  200717 162.456 < 2.2e-16 ***
## s(year, df = 6)    1   22090   22090  17.879 2.425e-05 ***
## education          4 1069323  267331 216.372 < 2.2e-16 ***
## Residuals       2983 3685543    1236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                 Npar Df  Npar F  Pr(F)    
## (Intercept)                               
## s(age, df = 6)        5 26.2089 <2e-16 ***
## s(year, df = 6)       5  1.0144 0.4074    
## education                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Plotting the Model
par(mfrow=c(1,3))
plot(gam1,se = TRUE)

In the above Plots the Y-axis contains the Non Linear functions and x-axis contains the Predictors used in the Model and the dashed lines Represent the Standard Error bands.The Whole Model is Additive in nature.

\[\textbf {The Curvy plots shows that the functions are Non linear in nature}\]

We can also fit a Logistic Regression Model using gam()

#logistic Regression Model
gam2<-gam(I(wage >250) ~ s(age,df=4) + s(year,df=4) +education , data=Wage,family=binomial)

plot(gam2,se=T)

So we are plotting the logit of Probabilities of each variable as a saperate function but on the whole additive in nature.

Now we can also check if we need Non linear Terms for Year variable or not?

#fitting the Additive Regression Model which is linear in Year
gam3<-gam(I(wage >250) ~ s(age,df=4)+ year + education , data =Wage, family = binomial)
plot(gam3)

#anova() function to test the goodness of fit and choose the best Model
#Using Chi-squared Non parametric Test due to Classification Problem and categorial Target
anova(gam2,gam3,test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: I(wage > 250) ~ s(age, df = 4) + s(year, df = 4) + education
## Model 2: I(wage > 250) ~ s(age, df = 4) + year + education
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      2987     602.87                     
## 2      2990     603.78 -3 -0.90498   0.8242

\[\text {The plot for the Year is a straight Line i.e it is Linear function in Year.}\]

As the above Test indicates that Model with Non linear terms for Year is not Significant.So we can neglect that Model.

Now we can also fit a Additive Model using lm() function

lm1<-lm(wage ~ ns(age,df=4) + ns(year,df=4)+ education , data  = Wage)
#ns() is function used to fit a Natural Spline
lm1

## 
## Call:
## lm(formula = wage ~ ns(age, df = 4) + ns(year, df = 4) + education, 
##     data = Wage)
## 
## Coefficients:
##                 (Intercept)             ns(age, df = 4)1  
##                      43.976                       46.541  
##            ns(age, df = 4)2             ns(age, df = 4)3  
##                      29.070                       63.853  
##            ns(age, df = 4)4            ns(year, df = 4)1  
##                      10.881                        8.417  
##           ns(year, df = 4)2            ns(year, df = 4)3  
##                       3.596                        8.000  
##           ns(year, df = 4)4          education2. HS Grad  
##                       6.701                       10.870  
##    education3. Some College     education4. College Grad  
##                      23.354                       38.112  
## education5. Advanced Degree  
##                      62.517

#Now plotting the Model

plot.gam(lm1,se=T)

#Hence the Results are same

So by using the lm() function too we can fit a Genaralized Additive Model.

Conclusion

Hence GAMs are a very nice technique and method to Model Non linearities and Learn complex function other than just Linear functions.They are easily interpretable too.

And the most basic idea behind learning Non Linearities is to transform the Data and the variables which can capture and Learn and make sense of something more complicated than just a linear relationship.

\[\text {Because the truth is not always "Linear"}\]