Introduction

Generalized Linear Models(GLM) function is important for us when we analyze and predict data. We can use this function to build different models to analyze data.

Content Overview

We will introduce the Generalized Linear Models(GLM) in R situation and how to use it. We will show some examples of GLM function. The dataset is from ISLR package.

Why You Should Care

Unlike a simple linear regression model, generalized linear models consider the dependent variables that are not limited to normal distribution error. Researchers have more options when they use GLM function. They can choose different error distribution and model different types of models using this function.

Learning Objectives

We can learn how to build different models using GLM models by changing the factors in function and compare the different models. After build models, we will make data predictions and check their accuracy.

GLM Function

We will use Default package in ISLR package to show different results. We can compare the differences between different models.

Further Exposition

We will show the processes about building models using GLM function. We use the confusion matrix to compare the result of different models.

Basic Example

We can use GLM function to build different models. Here, we will show the logit model and probit model as examples.

logit.lm=glm(default~.,data= Default, family = binomial(link = "logit"))
summary(logit.lm)

## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4691  -0.1418  -0.0557  -0.0203   3.7383  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
## studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
## balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
## income       3.033e-06  8.203e-06   0.370  0.71152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.5  on 9996  degrees of freedom
## AIC: 1579.5
## 
## Number of Fisher Scoring iterations: 8

probit.lm=glm(default~.,data= Default, family = binomial(link = "probit"))
summary(probit.lm)

## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "probit"), 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2226  -0.1354  -0.0321  -0.0044   4.1254  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.475e+00  2.385e-01 -22.960   <2e-16 ***
## studentYes  -2.960e-01  1.188e-01  -2.491   0.0127 *  
## balance      2.821e-03  1.139e-04  24.774   <2e-16 ***
## income       2.101e-06  4.121e-06   0.510   0.6101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1583.2  on 9996  degrees of freedom
## AIC: 1591.2
## 
## Number of Fisher Scoring iterations: 8

Advanced Examples

In the following, we will compare different models that were built by GLM function. First, We will separate data to training data and valid data. Then, we build the logit and probit model in training data.

set.seed(1)
train.index=sample(c(1:dim(Default)[1]),
                   dim(Default)[1]*0.5)
train.df=Default[train.index,]
valid.df=Default[-train.index,]
logit.lm.train=glm(default~.,data= train.df, family = binomial(link = "logit"))
summary(logit.lm.train)

## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"), 
##     data = train.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5823  -0.1419  -0.0554  -0.0210   3.3961  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.134e+01  6.937e-01 -16.346   <2e-16 ***
## studentYes  -5.992e-01  3.324e-01  -1.803   0.0715 .  
## balance      5.767e-03  3.213e-04  17.947   <2e-16 ***
## income       1.686e-05  1.122e-05   1.502   0.1331    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.77  on 4999  degrees of freedom
## Residual deviance:  800.07  on 4996  degrees of freedom
## AIC: 808.07
## 
## Number of Fisher Scoring iterations: 8

probit.lm.train=glm(default~.,data= train.df, family = binomial(link = "probit"))
summary(probit.lm.train)

## 
## Call:
## glm(formula = default ~ ., family = binomial(link = "probit"), 
##     data = train.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3979  -0.1294  -0.0288  -0.0039   3.6174  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.813e+00  3.417e-01 -17.015   <2e-16 ***
## studentYes  -2.871e-01  1.693e-01  -1.696   0.0899 .  
## balance      2.905e-03  1.609e-04  18.054   <2e-16 ***
## income       9.001e-06  5.706e-06   1.577   0.1147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.77  on 4999  degrees of freedom
## Residual deviance:  799.55  on 4996  degrees of freedom
## AIC: 807.55
## 
## Number of Fisher Scoring iterations: 8

We calculate the predicted results with valid data using these two models.

anova(logit.lm.train,probit.lm.train,test="Chisq")

library(forecast)
logit.pred=predict(logit.lm.train,valid.df,type="response")
probit.pred=predict(probit.lm.train,valid.df,type="response")

dat=data.frame(actual=valid.df$default,
               predicted=logit.pred)
head(dat)

dat2=data.frame(actual=valid.df$default,
               predicted=probit.pred)
head(dat2)

We use the confusion matrix to compare the results. The accuracy values are the same for these two models. However, the Specificity of both models is low. The values of default are unbalanced. The percentage of “No” is much higher than “Yes”. We can use sampling or other ways to deal with this problem, which we can introduce in the future.

library(caret)
logitPre=ifelse(dat$predicted>0.5,"Yes","No")
datPre=as.factor(logitPre)
confusionMatrix(datPre,valid.df$default)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  4825  112
##        Yes   18   45
##                                           
##                Accuracy : 0.974           
##                  95% CI : (0.9692, 0.9782)
##     No Information Rate : 0.9686          
##     P-Value [Acc > NIR] : 0.01394         
##                                           
##                   Kappa : 0.3983          
##                                           
##  Mcnemar's Test P-Value : 3.445e-16       
##                                           
##             Sensitivity : 0.9963          
##             Specificity : 0.2866          
##          Pos Pred Value : 0.9773          
##          Neg Pred Value : 0.7143          
##              Prevalence : 0.9686          
##          Detection Rate : 0.9650          
##    Detection Prevalence : 0.9874          
##       Balanced Accuracy : 0.6415          
##                                           
##        'Positive' Class : No              
##

probitPre=ifelse(dat2$predicted>0.5,"Yes","No")
dat2Pre=as.factor(probitPre)
confusionMatrix(dat2Pre,valid.df$default)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  4828  115
##        Yes   15   42
##                                           
##                Accuracy : 0.974           
##                  95% CI : (0.9692, 0.9782)
##     No Information Rate : 0.9686          
##     P-Value [Acc > NIR] : 0.01394         
##                                           
##                   Kappa : 0.3822          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.9969          
##             Specificity : 0.2675          
##          Pos Pred Value : 0.9767          
##          Neg Pred Value : 0.7368          
##              Prevalence : 0.9686          
##          Detection Rate : 0.9656          
##    Detection Prevalence : 0.9886          
##       Balanced Accuracy : 0.6322          
##                                           
##        'Positive' Class : No              
##

Further Resources

Learn more about [package, technique, dataset] with the following:

Resource I glm function
Resource II Package ISLR

Works Cited

This code through references and cites the following sources:

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani (2013). Source I. An Introduction to Statistical Learning with Applications in R
Wiki (2020). Source II. Generalized linear model
R Stats Package (2020). Source III. Fitting Generalized Linear Models
James, G., Witten, D., Hastie, T.,, Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R . Springer.

GLM Function

Rongxin Zhang

01 August 2020

Introduction

Content Overview

Why You Should Care

Learning Objectives

GLM Function

Further Exposition

Basic Example

Advanced Examples

Further Resources

Works Cited