Generalized Linear Models(GLM) function is important for us when we analyze and predict data. We can use this function to build different models to analyze data.
We will introduce the Generalized Linear Models(GLM) in R situation and how to use it. We will show some examples of GLM function. The dataset is from ISLR package.
Unlike a simple linear regression model, generalized linear models consider the dependent variables that are not limited to normal distribution error. Researchers have more options when they use GLM function. They can choose different error distribution and model different types of models using this function.
We can learn how to build different models using GLM models by changing the factors in function and compare the different models. After build models, we will make data predictions and check their accuracy.
We will use Default package in ISLR package to show different results. We can compare the differences between different models.
We will show the processes about building models using GLM function. We use the confusion matrix to compare the result of different models.
We can use GLM function to build different models. Here, we will show the logit model and probit model as examples.
##
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"),
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4691 -0.1418 -0.0557 -0.0203 3.7383
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
## studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
## balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
## income 3.033e-06 8.203e-06 0.370 0.71152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.5 on 9996 degrees of freedom
## AIC: 1579.5
##
## Number of Fisher Scoring iterations: 8
##
## Call:
## glm(formula = default ~ ., family = binomial(link = "probit"),
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2226 -0.1354 -0.0321 -0.0044 4.1254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.475e+00 2.385e-01 -22.960 <2e-16 ***
## studentYes -2.960e-01 1.188e-01 -2.491 0.0127 *
## balance 2.821e-03 1.139e-04 24.774 <2e-16 ***
## income 2.101e-06 4.121e-06 0.510 0.6101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1583.2 on 9996 degrees of freedom
## AIC: 1591.2
##
## Number of Fisher Scoring iterations: 8
In the following, we will compare different models that were built by GLM function. First, We will separate data to training data and valid data. Then, we build the logit and probit model in training data.
set.seed(1)
train.index=sample(c(1:dim(Default)[1]),
dim(Default)[1]*0.5)
train.df=Default[train.index,]
valid.df=Default[-train.index,]
logit.lm.train=glm(default~.,data= train.df, family = binomial(link = "logit"))
summary(logit.lm.train)##
## Call:
## glm(formula = default ~ ., family = binomial(link = "logit"),
## data = train.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5823 -0.1419 -0.0554 -0.0210 3.3961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.134e+01 6.937e-01 -16.346 <2e-16 ***
## studentYes -5.992e-01 3.324e-01 -1.803 0.0715 .
## balance 5.767e-03 3.213e-04 17.947 <2e-16 ***
## income 1.686e-05 1.122e-05 1.502 0.1331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.77 on 4999 degrees of freedom
## Residual deviance: 800.07 on 4996 degrees of freedom
## AIC: 808.07
##
## Number of Fisher Scoring iterations: 8
probit.lm.train=glm(default~.,data= train.df, family = binomial(link = "probit"))
summary(probit.lm.train)##
## Call:
## glm(formula = default ~ ., family = binomial(link = "probit"),
## data = train.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3979 -0.1294 -0.0288 -0.0039 3.6174
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.813e+00 3.417e-01 -17.015 <2e-16 ***
## studentYes -2.871e-01 1.693e-01 -1.696 0.0899 .
## balance 2.905e-03 1.609e-04 18.054 <2e-16 ***
## income 9.001e-06 5.706e-06 1.577 0.1147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.77 on 4999 degrees of freedom
## Residual deviance: 799.55 on 4996 degrees of freedom
## AIC: 807.55
##
## Number of Fisher Scoring iterations: 8
We calculate the predicted results with valid data using these two models.
library(forecast)
logit.pred=predict(logit.lm.train,valid.df,type="response")
probit.pred=predict(probit.lm.train,valid.df,type="response")
dat=data.frame(actual=valid.df$default,
predicted=logit.pred)
head(dat)We use the confusion matrix to compare the results. The accuracy values are the same for these two models. However, the Specificity of both models is low. The values of default are unbalanced. The percentage of “No” is much higher than “Yes”. We can use sampling or other ways to deal with this problem, which we can introduce in the future.
library(caret)
logitPre=ifelse(dat$predicted>0.5,"Yes","No")
datPre=as.factor(logitPre)
confusionMatrix(datPre,valid.df$default)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 4825 112
## Yes 18 45
##
## Accuracy : 0.974
## 95% CI : (0.9692, 0.9782)
## No Information Rate : 0.9686
## P-Value [Acc > NIR] : 0.01394
##
## Kappa : 0.3983
##
## Mcnemar's Test P-Value : 3.445e-16
##
## Sensitivity : 0.9963
## Specificity : 0.2866
## Pos Pred Value : 0.9773
## Neg Pred Value : 0.7143
## Prevalence : 0.9686
## Detection Rate : 0.9650
## Detection Prevalence : 0.9874
## Balanced Accuracy : 0.6415
##
## 'Positive' Class : No
##
probitPre=ifelse(dat2$predicted>0.5,"Yes","No")
dat2Pre=as.factor(probitPre)
confusionMatrix(dat2Pre,valid.df$default)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 4828 115
## Yes 15 42
##
## Accuracy : 0.974
## 95% CI : (0.9692, 0.9782)
## No Information Rate : 0.9686
## P-Value [Acc > NIR] : 0.01394
##
## Kappa : 0.3822
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.9969
## Specificity : 0.2675
## Pos Pred Value : 0.9767
## Neg Pred Value : 0.7368
## Prevalence : 0.9686
## Detection Rate : 0.9656
## Detection Prevalence : 0.9886
## Balanced Accuracy : 0.6322
##
## 'Positive' Class : No
##
Learn more about [package, technique, dataset] with the following:
Resource I glm function
Resource II Package ISLR
This code through references and cites the following sources:
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani (2013). Source I. An Introduction to Statistical Learning with Applications in R
Wiki (2020). Source II. Generalized linear model
R Stats Package (2020). Source III. Fitting Generalized Linear Models
James, G., Witten, D., Hastie, T.,, Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R . Springer.