This chapter introduces models for categorical and other discrete response variables. GLMs for categorical responses include logistic regression model
##If the outcome is a count (for discrete response variables), then use loginear model# ##If the outcome is a binary count (for discrete response variables), then use logistic regression model# or linear probability model, it mimics ordinary linear modeling and use identity link function P(Y=1)=
introduces inference,methods, model checking, and model fitting for GLMs
================= Random component=Y (identifies response variable Y and its probability distribution). Binomial,Poission or nonnegative binomial, etc Linear predictor= specifies the explanatory variables through a predictin equation that has a linear form,
Link function= secifies afunction E(Y) that the GLM relates to the linear predictor, log link functions is appropiate when mucan’t take negative values such as with count data g(mu)=log[mu/(1-mu)]
Binomial, linear probability model interpretation For example, the parameter β1 represents the change in P(Y = 1) per unit change in x1, adjusting for the other variables. This model is a GLM with a binomial random component and identity link function
Heart <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart.dat", header=TRUE)
Heart #in contingency table form
## snoring yes no
## 1 never 24 1355
## 2 occasional 35 603
## 3 nearly_every_night 21 192
## 4 every_night 30 224
library(dplyr) # to recode explanatory variable
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Heart$x <- recode(Heart$snoring, never = 0, occasional = 2, nearly_every_night = 4, every_night = 5)
n <- Heart$yes + Heart$no # binomial sample sizes are the row totals
fit <- glm(yes/n ~ x, family=binomial(link=logit), weights=n, data=Heart)
# canonical link for binomial is logit, so "(link=logit)" not necessary
# "weights" indicates sample proportion yes/n is based on n observations
summary(fit)
##
## Call:
## glm(formula = yes/n ~ x, family = binomial(link = logit), data = Heart,
## weights = n)
##
## Deviance Residuals:
## 1 2 3 4
## -0.8346 1.2521 0.2758 -0.6845
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.86625 0.16621 -23.261 < 2e-16 ***
## x 0.39734 0.05001 7.945 1.94e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65.9045 on 3 degrees of freedom
## Residual deviance: 2.8089 on 2 degrees of freedom
## AIC: 27.061
##
## Number of Fisher Scoring iterations: 4
fitted(fit)
## 1 2 3 4
## 0.02050742 0.04429511 0.09305411 0.13243885
Identity link function
fit2 <- glm(yes/n ~ x, family=quasi(link=identity, variance="mu(1-mu)"),weights=n, data=Heart)
summary(fit2, dispersion=1)
##
## Call:
## glm(formula = yes/n ~ x, family = quasi(link = identity, variance = "mu(1-mu)"),
## data = Heart, weights = n)
##
## Deviance Residuals:
## 1 2 3 4
## 0.04478 -0.21322 0.11010 0.09798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.017247 0.003451 4.998 5.80e-07 ***
## x 0.019778 0.002805 7.051 1.77e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasi family taken to be 1)
##
## Null deviance: 65.904481 on 3 degrees of freedom
## Residual deviance: 0.069191 on 2 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
#which yields the resultˆ P(Y = 1) = 0.017 + 0.020x
Ungroupped binary data A standard data file has a row for each subject (e.g., person) and a column for each variable. For a binary response variable, the data file has a column of 1 (success) and 0 (failure) values.
Heart2 <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart2.dat", header=TRUE)
head(Heart2,10)
## subject snoring x y
## 1 1 never 0 1
## 2 2 never 0 1
## 3 3 never 0 1
## 4 4 never 0 1
## 5 5 never 0 1
## 6 6 never 0 1
## 7 7 never 0 1
## 8 8 never 0 1
## 9 9 never 0 1
## 10 10 never 0 1
3.3.2 Poisson Loglinear Model
The response outcome for each female crab is her number of satellites. An explanatory variable thought possibly to affect this was the female crab’s carapace (shell) width, which is a summary of her size.
Crabs <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat", header=TRUE)
Crabs
## crab sat y weight width color spine
## 1 1 8 1 3.050 28.3 2 3
## 2 2 0 0 1.550 22.5 3 3
## 3 3 9 1 2.300 26.0 1 1
## 4 4 0 0 2.100 24.8 3 3
## 5 5 4 1 2.600 26.0 3 3
## 6 6 0 0 2.100 23.8 2 3
## 7 7 0 0 2.350 26.5 1 1
## 8 8 0 0 1.900 24.7 3 2
## 9 9 0 0 1.950 23.7 2 1
## 10 10 0 0 2.150 25.6 3 3
## 11 11 0 0 2.150 24.3 3 3
## 12 12 0 0 2.650 25.8 2 3
## 13 13 11 1 3.050 28.2 2 3
## 14 14 0 0 1.850 21.0 4 2
## 15 15 14 1 2.300 26.0 2 1
## 16 16 8 1 2.950 27.1 1 1
## 17 17 1 1 2.000 25.2 2 3
## 18 18 1 1 3.000 29.0 2 3
## 19 19 0 0 2.200 24.7 4 3
## 20 20 5 1 2.700 27.4 2 3
## 21 21 4 1 1.950 23.2 2 2
## 22 22 3 1 2.300 25.0 1 2
## 23 23 1 1 1.600 22.5 2 1
## 24 24 2 1 2.600 26.7 3 3
## 25 25 3 1 2.000 25.8 4 3
## 26 26 0 0 1.300 26.2 4 3
## 27 27 3 1 3.150 28.7 2 3
## 28 28 5 1 2.700 26.8 2 1
## 29 29 0 0 2.600 27.5 4 3
## 30 30 0 0 2.100 24.9 2 3
## 31 31 4 1 3.200 29.3 1 1
## 32 32 0 0 2.600 25.8 1 3
## 33 33 0 0 2.000 25.7 2 2
## 34 34 8 1 2.000 25.7 2 1
## 35 35 5 1 2.700 26.7 2 1
## 36 36 0 0 1.850 23.7 4 3
## 37 37 0 0 2.650 26.8 2 3
## 38 38 6 1 3.150 27.5 2 3
## 39 39 0 0 1.900 23.4 4 3
## 40 40 6 1 2.800 27.9 2 3
## 41 41 3 1 3.100 27.5 3 3
## 42 42 5 1 2.800 26.1 1 1
## 43 43 6 1 2.500 27.7 1 1
## 44 44 5 1 3.300 30.0 2 1
## 45 45 9 1 3.250 28.5 3 1
## 46 46 4 1 2.800 28.9 3 3
## 47 47 6 1 2.600 28.2 2 3
## 48 48 4 1 2.100 25.0 2 3
## 49 49 3 1 3.000 28.5 2 3
## 50 50 3 1 3.600 30.3 2 1
## 51 51 5 1 2.100 24.7 4 3
## 52 52 5 1 2.900 27.7 2 3
## 53 53 6 1 2.700 27.4 1 1
## 54 54 4 1 1.600 22.9 2 3
## 55 55 5 1 2.000 25.7 2 1
## 56 56 15 1 3.000 28.3 2 3
## 57 57 3 1 2.700 27.2 2 3
## 58 58 3 1 2.300 26.2 3 3
## 59 59 0 0 2.750 27.8 2 1
## 60 60 0 0 2.250 25.5 4 3
## 61 61 0 0 2.550 27.1 3 3
## 62 62 5 1 2.050 24.5 3 3
## 63 63 3 1 2.450 27.0 3 1
## 64 64 5 1 2.150 26.0 2 3
## 65 65 1 1 2.800 28.0 2 3
## 66 66 8 1 3.050 30.0 2 3
## 67 67 10 1 3.200 29.0 2 3
## 68 68 0 0 2.400 26.2 2 3
## 69 69 0 0 1.300 26.5 2 1
## 70 70 3 1 2.400 26.2 2 3
## 71 71 7 1 2.800 25.6 3 3
## 72 72 1 1 1.650 23.0 3 3
## 73 73 0 0 1.800 23.0 3 3
## 74 74 6 1 2.250 25.4 2 3
## 75 75 0 0 1.900 24.2 3 3
## 76 76 0 0 1.600 22.9 2 2
## 77 77 3 1 2.200 26.0 3 2
## 78 78 4 1 2.250 25.4 2 3
## 79 79 0 0 1.200 25.7 3 3
## 80 80 5 1 2.100 25.1 2 3
## 81 81 0 0 2.250 24.5 3 2
## 82 82 0 0 2.900 27.5 4 3
## 83 83 0 0 1.650 23.1 3 3
## 84 84 4 1 2.550 25.9 3 1
## 85 85 0 0 2.300 25.8 2 3
## 86 86 3 1 2.250 27.0 4 3
## 87 87 0 0 3.050 28.5 2 3
## 88 88 0 0 2.750 25.5 4 1
## 89 89 0 0 1.900 23.5 4 3
## 90 90 0 0 1.700 24.0 2 2
## 91 91 5 1 3.850 29.7 2 1
## 92 92 0 0 2.550 26.8 2 1
## 93 93 0 0 2.450 26.7 4 3
## 94 94 0 0 3.200 28.7 2 1
## 95 95 0 0 1.550 23.1 3 3
## 96 96 1 1 2.800 29.0 2 1
## 97 97 0 0 2.250 25.5 3 3
## 98 98 1 1 1.967 26.5 3 3
## 99 99 1 1 2.200 24.5 3 3
## 100 100 1 1 3.000 28.5 3 3
## 101 101 1 1 2.867 28.2 2 3
## 102 102 1 1 1.600 24.5 2 3
## 103 103 1 1 2.550 27.5 2 3
## 104 104 4 1 2.550 24.7 2 2
## 105 105 1 1 2.000 25.2 2 1
## 106 106 1 1 2.900 27.3 3 3
## 107 107 1 1 2.400 26.3 2 3
## 108 108 1 1 3.100 29.0 2 3
## 109 109 2 1 1.900 25.3 2 3
## 110 110 4 1 2.300 26.5 2 3
## 111 111 3 1 3.250 27.8 2 3
## 112 112 6 1 2.500 27.0 2 3
## 113 113 0 0 2.100 25.7 3 3
## 114 114 2 1 2.100 25.0 2 3
## 115 115 2 1 3.325 31.9 2 3
## 116 116 0 0 1.800 23.7 4 3
## 117 117 12 1 3.225 29.3 4 3
## 118 118 0 0 1.400 22.0 3 3
## 119 119 5 1 2.400 25.0 2 3
## 120 120 6 1 2.500 27.0 3 3
## 121 121 6 1 1.800 23.8 3 3
## 122 122 2 1 3.275 30.2 1 1
## 123 123 0 0 2.225 26.2 3 3
## 124 124 2 1 1.650 24.2 2 3
## 125 125 3 1 2.900 27.4 2 3
## 126 126 0 0 2.300 25.4 2 2
## 127 127 3 1 3.200 28.4 3 3
## 128 128 4 1 1.475 22.5 4 3
## 129 129 2 1 2.025 26.2 2 3
## 130 130 6 1 2.300 24.9 2 1
## 131 131 6 1 1.950 24.5 1 2
## 132 132 0 0 1.800 25.1 2 3
## 133 133 4 1 2.900 28.0 2 1
## 134 134 10 1 2.250 25.8 4 3
## 135 135 7 1 3.050 27.9 2 3
## 136 136 0 0 2.200 24.9 2 3
## 137 137 5 1 3.100 28.4 2 1
## 138 138 5 1 2.400 27.2 3 3
## 139 139 6 1 2.250 25.0 2 2
## 140 140 6 1 2.625 27.5 2 3
## 141 141 7 1 5.200 33.5 2 1
## 142 142 3 1 3.325 30.5 2 3
## 143 143 3 1 2.925 29.0 3 3
## 144 144 0 0 2.000 24.3 2 1
## 145 145 0 0 2.400 25.8 2 3
## 146 146 8 1 2.100 25.0 4 3
## 147 147 4 1 3.725 31.7 2 1
## 148 148 4 1 3.025 29.5 2 3
## 149 149 10 1 1.900 24.0 3 3
## 150 150 9 1 3.000 30.0 2 3
## 151 151 4 1 2.850 27.6 2 3
## 152 152 0 0 2.300 26.2 2 3
## 153 153 0 0 2.000 23.1 2 1
## 154 154 0 0 1.600 22.9 2 1
## 155 155 0 0 1.900 24.5 4 3
## 156 156 4 1 1.950 24.7 2 3
## 157 157 0 0 3.200 28.3 2 3
## 158 158 2 1 1.850 23.9 2 3
## 159 159 0 0 1.800 23.8 3 3
## 160 160 4 1 3.500 29.8 3 2
## 161 161 4 1 2.350 26.5 2 3
## 162 162 3 1 2.275 26.0 2 3
## 163 163 8 1 3.050 28.2 2 3
## 164 164 0 0 2.150 25.7 4 3
## 165 165 7 1 2.750 26.5 2 3
## 166 166 0 0 2.200 25.8 2 3
## 167 167 0 0 1.800 24.1 3 3
## 168 168 2 1 2.175 26.2 3 3
## 169 169 3 1 2.750 26.1 3 3
## 170 170 4 1 3.275 29.0 3 3
## 171 171 0 0 2.625 28.0 1 1
## 172 172 0 0 2.625 27.0 4 3
## 173 173 0 0 2.000 24.5 2 2
plot(sat ~ width, xlab="Width", ylab="Number of satellites", data=Crabs)
fit <- glm(sat ~ width, family=poisson(link=log), data=Crabs) # canonical link for Poisson is log, so "(link=log)" is not necessary
summary(fit)
##
## Call:
## glm(formula = sat ~ width, family = poisson(link = log), data = Crabs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
#Great variability GAM helps to discern trend
library(gam) # generalized additive model smoothing fit
## Warning: package 'gam' was built under R version 4.1.3
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20.1
gam.fit <- gam(sat ~ s(width), family=poisson, data=Crabs)
summary(gam.fit) #foes not say much
##
## Call: gam(formula = sat ~ s(width), family = poisson, data = Crabs)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9444 -1.9749 -0.5479 0.9783 4.7707
##
## (Dispersion Parameter for poisson family taken to be 1)
##
## Null Deviance: 632.7917 on 172 degrees of freedom
## Residual Deviance: 555.1206 on 168 degrees of freedom
## AIC: 920.4184
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(width) 1 55.55 55.545 17.473 4.679e-05 ***
## Residuals 168 534.06 3.179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar Chisq P(Chi)
## (Intercept)
## s(width) 3 11.719 0.00841 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# s() is smooth function predictor for generalized additive model
curve(predict(gam.fit, data.frame(width=x), type="resp"), add=TRUE)
#exp(0.164) = 1.178 represents the multiplicative effect on the fitted value for each 1-cm increase in x.
#For instance, the fitted value at x = 27.3 = 26.3 + 1 is exp[−3.305 + 0.164(27.3)] = 3.23, which equals 1.178(2.74).
STATISTICAL INFERENCE AND MODEL CHECKING Wald, Likelihood-Ratio and Score Inference Use the Likelihood Function
Evo <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Evolution.dat", header=TRUE)
Evo
## ideology true false
## 1 1 11 37
## 2 2 46 104
## 3 3 70 72
## 4 4 241 214
## 5 5 78 36
## 6 6 89 24
## 7 7 36 6
n <- Evo$true + Evo$false # binomial sample sizes
fit <- glm(true/n ~ ideology, family=binomial, weights=n, data=Evo)
summary(fit) # logistic regression fit
##
## Call:
## glm(formula = true/n ~ ideology, family = binomial, data = Evo,
## weights = n)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7
## 0.1430 -0.2697 1.4614 -1.0791 0.2922 0.4471 0.2035
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.75658 0.20500 -8.569 <2e-16 ***
## ideology 0.49422 0.05092 9.706 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 113.20 on 6 degrees of freedom
## Residual deviance: 3.72 on 5 degrees of freedom
## AIC: 42.332
##
## Number of Fisher Scoring iterations: 3
confint(fit) # profile likelihood CI
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.165294 -1.3609733
## ideology 0.396166 0.5959414
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Anova(fit)# likelihood-ratio tests for effect parameters in a GLM
## Analysis of Deviance Table (Type II tests)
##
## Response: true/n
## LR Chisq Df Pr(>Chisq)
## ideology 109.48 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(statmod)
## Warning: package 'statmod' was built under R version 4.1.3
fit0 <- glm(true/n ~ 1, family=binomial, weights=n, data=Evo) # null model
#glm.scoretest(fit0, Evo$ideology)
#glm.scoretest(fit0, Evo$ideology)ˆ2 # squaring a z score statistic
# 104.101 # score chi-squared statistic with df=1
TDEVIANCE The deviance is the likelihood ratio statistics for comparing model M to the saturted mode It represents the square root of the contribution that each data point has to the overall residual deviance are analogous in the ordinary least squares, when we square the deviance residual we get the residual deviance that we use to asses how well a model fits the data, we can use the t identify the outliers
For such cases, the deviance provides a goodness-of-fit test of the model, because it tests the hypothesis that all possible parameters not included in the model equal 0. the residual deviance for a model is the likelihood-ratio statistic for testing that all the β effect terms in the model equal 0. Deviance residuals are alternative measures of lack of fit that decompose the deviance
The residual df equals the number of observations minus the number ofmodel parameters.
we can compare the models by comparing their deviances. This test statistic is large when M0 fits poorly compared to M1.
attach(Evo)
cbind(ideology,true,false,n,true/n,fitted(fit),rstandard(fit,type="pearson"))
## ideology true false n
## 1 1 11 37 48 0.2291667 0.2205679 0.1611162
## 2 2 46 104 150 0.3066667 0.3168813 -0.3515386
## 3 3 70 72 142 0.4929577 0.4319445 1.6480176
## 4 4 241 214 455 0.5296703 0.5548525 -1.4995488
## 5 5 78 36 114 0.6842105 0.6713982 0.3248519
## 6 6 89 24 113 0.7876106 0.7700750 0.5413625
## 7 7 36 6 42 0.8571429 0.8459201 0.2206605
# rstandard(fit, type="pearson") requests standardized residuals
#ideology|| true || false|| n || # sample|| fitted | std. res.
3.5 FITTING GENERALIZED LINEAR MODELS 3.5.1 The Fisher Scoring Algorithm Fits GLMs