Generalized linear model,ch 3

This chapter introduces models for categorical and other discrete response variables. GLMs for categorical responses include logistic regression model

##If the outcome is a count (for discrete response variables), then use loginear model# ##If the outcome is a binary count (for discrete response variables), then use logistic regression model# or linear probability model, it mimics ordinary linear modeling and use identity link function P(Y=1)=

introduces inference,methods, model checking, and model fitting for GLMs

================= Random component=Y (identifies response variable Y and its probability distribution). Binomial,Poission or nonnegative binomial, etc Linear predictor= specifies the explanatory variables through a predictin equation that has a linear form,

Link function= secifies afunction E(Y) that the GLM relates to the linear predictor, log link functions is appropiate when mucan’t take negative values such as with count data g(mu)=log[mu/(1-mu)]

Binomial, linear probability model interpretation For example, the parameter β1 represents the change in P(Y = 1) per unit change in x1, adjusting for the other variables. This model is a GLM with a binomial random component and identity link function

Heart <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart.dat", header=TRUE)
Heart #in contingency table form

##              snoring yes   no
## 1              never  24 1355
## 2         occasional  35  603
## 3 nearly_every_night  21  192
## 4        every_night  30  224

library(dplyr) # to recode explanatory variable

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Heart$x <- recode(Heart$snoring, never = 0, occasional = 2, nearly_every_night = 4, every_night = 5)
n <- Heart$yes + Heart$no # binomial sample sizes are the row totals
fit <- glm(yes/n ~ x, family=binomial(link=logit), weights=n, data=Heart)
 # canonical link for binomial is logit, so "(link=logit)" not necessary
  # "weights" indicates sample proportion yes/n is based on n observations
summary(fit)

## 
## Call:
## glm(formula = yes/n ~ x, family = binomial(link = logit), data = Heart, 
##     weights = n)
## 
## Deviance Residuals: 
##       1        2        3        4  
## -0.8346   1.2521   0.2758  -0.6845  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.86625    0.16621 -23.261  < 2e-16 ***
## x            0.39734    0.05001   7.945 1.94e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.9045  on 3  degrees of freedom
## Residual deviance:  2.8089  on 2  degrees of freedom
## AIC: 27.061
## 
## Number of Fisher Scoring iterations: 4

fitted(fit)

##          1          2          3          4 
## 0.02050742 0.04429511 0.09305411 0.13243885

Identity link function

fit2 <- glm(yes/n ~ x, family=quasi(link=identity, variance="mu(1-mu)"),weights=n, data=Heart)
summary(fit2, dispersion=1)

## 
## Call:
## glm(formula = yes/n ~ x, family = quasi(link = identity, variance = "mu(1-mu)"), 
##     data = Heart, weights = n)
## 
## Deviance Residuals: 
##        1         2         3         4  
##  0.04478  -0.21322   0.11010   0.09798  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.017247   0.003451   4.998 5.80e-07 ***
## x           0.019778   0.002805   7.051 1.77e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasi family taken to be 1)
## 
##     Null deviance: 65.904481  on 3  degrees of freedom
## Residual deviance:  0.069191  on 2  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3

#which yields the resultˆ P(Y = 1) = 0.017 + 0.020x

Ungroupped binary data A standard data file has a row for each subject (e.g., person) and a column for each variable. For a binary response variable, the data file has a column of 1 (success) and 0 (failure) values.

Heart2 <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Heart2.dat", header=TRUE)
head(Heart2,10)

##    subject snoring x y
## 1        1   never 0 1
## 2        2   never 0 1
## 3        3   never 0 1
## 4        4   never 0 1
## 5        5   never 0 1
## 6        6   never 0 1
## 7        7   never 0 1
## 8        8   never 0 1
## 9        9   never 0 1
## 10      10   never 0 1

3.3.2 Poisson Loglinear Model

The response outcome for each female crab is her number of satellites. An explanatory variable thought possibly to affect this was the female crab’s carapace (shell) width, which is a summary of her size.

Crabs <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Crabs.dat", header=TRUE)
Crabs

##     crab sat y weight width color spine
## 1      1   8 1  3.050  28.3     2     3
## 2      2   0 0  1.550  22.5     3     3
## 3      3   9 1  2.300  26.0     1     1
## 4      4   0 0  2.100  24.8     3     3
## 5      5   4 1  2.600  26.0     3     3
## 6      6   0 0  2.100  23.8     2     3
## 7      7   0 0  2.350  26.5     1     1
## 8      8   0 0  1.900  24.7     3     2
## 9      9   0 0  1.950  23.7     2     1
## 10    10   0 0  2.150  25.6     3     3
## 11    11   0 0  2.150  24.3     3     3
## 12    12   0 0  2.650  25.8     2     3
## 13    13  11 1  3.050  28.2     2     3
## 14    14   0 0  1.850  21.0     4     2
## 15    15  14 1  2.300  26.0     2     1
## 16    16   8 1  2.950  27.1     1     1
## 17    17   1 1  2.000  25.2     2     3
## 18    18   1 1  3.000  29.0     2     3
## 19    19   0 0  2.200  24.7     4     3
## 20    20   5 1  2.700  27.4     2     3
## 21    21   4 1  1.950  23.2     2     2
## 22    22   3 1  2.300  25.0     1     2
## 23    23   1 1  1.600  22.5     2     1
## 24    24   2 1  2.600  26.7     3     3
## 25    25   3 1  2.000  25.8     4     3
## 26    26   0 0  1.300  26.2     4     3
## 27    27   3 1  3.150  28.7     2     3
## 28    28   5 1  2.700  26.8     2     1
## 29    29   0 0  2.600  27.5     4     3
## 30    30   0 0  2.100  24.9     2     3
## 31    31   4 1  3.200  29.3     1     1
## 32    32   0 0  2.600  25.8     1     3
## 33    33   0 0  2.000  25.7     2     2
## 34    34   8 1  2.000  25.7     2     1
## 35    35   5 1  2.700  26.7     2     1
## 36    36   0 0  1.850  23.7     4     3
## 37    37   0 0  2.650  26.8     2     3
## 38    38   6 1  3.150  27.5     2     3
## 39    39   0 0  1.900  23.4     4     3
## 40    40   6 1  2.800  27.9     2     3
## 41    41   3 1  3.100  27.5     3     3
## 42    42   5 1  2.800  26.1     1     1
## 43    43   6 1  2.500  27.7     1     1
## 44    44   5 1  3.300  30.0     2     1
## 45    45   9 1  3.250  28.5     3     1
## 46    46   4 1  2.800  28.9     3     3
## 47    47   6 1  2.600  28.2     2     3
## 48    48   4 1  2.100  25.0     2     3
## 49    49   3 1  3.000  28.5     2     3
## 50    50   3 1  3.600  30.3     2     1
## 51    51   5 1  2.100  24.7     4     3
## 52    52   5 1  2.900  27.7     2     3
## 53    53   6 1  2.700  27.4     1     1
## 54    54   4 1  1.600  22.9     2     3
## 55    55   5 1  2.000  25.7     2     1
## 56    56  15 1  3.000  28.3     2     3
## 57    57   3 1  2.700  27.2     2     3
## 58    58   3 1  2.300  26.2     3     3
## 59    59   0 0  2.750  27.8     2     1
## 60    60   0 0  2.250  25.5     4     3
## 61    61   0 0  2.550  27.1     3     3
## 62    62   5 1  2.050  24.5     3     3
## 63    63   3 1  2.450  27.0     3     1
## 64    64   5 1  2.150  26.0     2     3
## 65    65   1 1  2.800  28.0     2     3
## 66    66   8 1  3.050  30.0     2     3
## 67    67  10 1  3.200  29.0     2     3
## 68    68   0 0  2.400  26.2     2     3
## 69    69   0 0  1.300  26.5     2     1
## 70    70   3 1  2.400  26.2     2     3
## 71    71   7 1  2.800  25.6     3     3
## 72    72   1 1  1.650  23.0     3     3
## 73    73   0 0  1.800  23.0     3     3
## 74    74   6 1  2.250  25.4     2     3
## 75    75   0 0  1.900  24.2     3     3
## 76    76   0 0  1.600  22.9     2     2
## 77    77   3 1  2.200  26.0     3     2
## 78    78   4 1  2.250  25.4     2     3
## 79    79   0 0  1.200  25.7     3     3
## 80    80   5 1  2.100  25.1     2     3
## 81    81   0 0  2.250  24.5     3     2
## 82    82   0 0  2.900  27.5     4     3
## 83    83   0 0  1.650  23.1     3     3
## 84    84   4 1  2.550  25.9     3     1
## 85    85   0 0  2.300  25.8     2     3
## 86    86   3 1  2.250  27.0     4     3
## 87    87   0 0  3.050  28.5     2     3
## 88    88   0 0  2.750  25.5     4     1
## 89    89   0 0  1.900  23.5     4     3
## 90    90   0 0  1.700  24.0     2     2
## 91    91   5 1  3.850  29.7     2     1
## 92    92   0 0  2.550  26.8     2     1
## 93    93   0 0  2.450  26.7     4     3
## 94    94   0 0  3.200  28.7     2     1
## 95    95   0 0  1.550  23.1     3     3
## 96    96   1 1  2.800  29.0     2     1
## 97    97   0 0  2.250  25.5     3     3
## 98    98   1 1  1.967  26.5     3     3
## 99    99   1 1  2.200  24.5     3     3
## 100  100   1 1  3.000  28.5     3     3
## 101  101   1 1  2.867  28.2     2     3
## 102  102   1 1  1.600  24.5     2     3
## 103  103   1 1  2.550  27.5     2     3
## 104  104   4 1  2.550  24.7     2     2
## 105  105   1 1  2.000  25.2     2     1
## 106  106   1 1  2.900  27.3     3     3
## 107  107   1 1  2.400  26.3     2     3
## 108  108   1 1  3.100  29.0     2     3
## 109  109   2 1  1.900  25.3     2     3
## 110  110   4 1  2.300  26.5     2     3
## 111  111   3 1  3.250  27.8     2     3
## 112  112   6 1  2.500  27.0     2     3
## 113  113   0 0  2.100  25.7     3     3
## 114  114   2 1  2.100  25.0     2     3
## 115  115   2 1  3.325  31.9     2     3
## 116  116   0 0  1.800  23.7     4     3
## 117  117  12 1  3.225  29.3     4     3
## 118  118   0 0  1.400  22.0     3     3
## 119  119   5 1  2.400  25.0     2     3
## 120  120   6 1  2.500  27.0     3     3
## 121  121   6 1  1.800  23.8     3     3
## 122  122   2 1  3.275  30.2     1     1
## 123  123   0 0  2.225  26.2     3     3
## 124  124   2 1  1.650  24.2     2     3
## 125  125   3 1  2.900  27.4     2     3
## 126  126   0 0  2.300  25.4     2     2
## 127  127   3 1  3.200  28.4     3     3
## 128  128   4 1  1.475  22.5     4     3
## 129  129   2 1  2.025  26.2     2     3
## 130  130   6 1  2.300  24.9     2     1
## 131  131   6 1  1.950  24.5     1     2
## 132  132   0 0  1.800  25.1     2     3
## 133  133   4 1  2.900  28.0     2     1
## 134  134  10 1  2.250  25.8     4     3
## 135  135   7 1  3.050  27.9     2     3
## 136  136   0 0  2.200  24.9     2     3
## 137  137   5 1  3.100  28.4     2     1
## 138  138   5 1  2.400  27.2     3     3
## 139  139   6 1  2.250  25.0     2     2
## 140  140   6 1  2.625  27.5     2     3
## 141  141   7 1  5.200  33.5     2     1
## 142  142   3 1  3.325  30.5     2     3
## 143  143   3 1  2.925  29.0     3     3
## 144  144   0 0  2.000  24.3     2     1
## 145  145   0 0  2.400  25.8     2     3
## 146  146   8 1  2.100  25.0     4     3
## 147  147   4 1  3.725  31.7     2     1
## 148  148   4 1  3.025  29.5     2     3
## 149  149  10 1  1.900  24.0     3     3
## 150  150   9 1  3.000  30.0     2     3
## 151  151   4 1  2.850  27.6     2     3
## 152  152   0 0  2.300  26.2     2     3
## 153  153   0 0  2.000  23.1     2     1
## 154  154   0 0  1.600  22.9     2     1
## 155  155   0 0  1.900  24.5     4     3
## 156  156   4 1  1.950  24.7     2     3
## 157  157   0 0  3.200  28.3     2     3
## 158  158   2 1  1.850  23.9     2     3
## 159  159   0 0  1.800  23.8     3     3
## 160  160   4 1  3.500  29.8     3     2
## 161  161   4 1  2.350  26.5     2     3
## 162  162   3 1  2.275  26.0     2     3
## 163  163   8 1  3.050  28.2     2     3
## 164  164   0 0  2.150  25.7     4     3
## 165  165   7 1  2.750  26.5     2     3
## 166  166   0 0  2.200  25.8     2     3
## 167  167   0 0  1.800  24.1     3     3
## 168  168   2 1  2.175  26.2     3     3
## 169  169   3 1  2.750  26.1     3     3
## 170  170   4 1  3.275  29.0     3     3
## 171  171   0 0  2.625  28.0     1     1
## 172  172   0 0  2.625  27.0     4     3
## 173  173   0 0  2.000  24.5     2     2

plot(sat ~ width, xlab="Width", ylab="Number of satellites", data=Crabs)
fit <- glm(sat ~ width, family=poisson(link=log), data=Crabs) # canonical link for Poisson is log, so "(link=log)" is not necessary
summary(fit)

## 
## Call:
## glm(formula = sat ~ width, family = poisson(link = log), data = Crabs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

#Great variability GAM helps to discern trend
library(gam) # generalized additive model smoothing fit

## Warning: package 'gam' was built under R version 4.1.3

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.20.1

gam.fit <- gam(sat ~ s(width), family=poisson, data=Crabs)
summary(gam.fit) #foes not say much

## 
## Call: gam(formula = sat ~ s(width), family = poisson, data = Crabs)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9444 -1.9749 -0.5479  0.9783  4.7707 
## 
## (Dispersion Parameter for poisson family taken to be 1)
## 
##     Null Deviance: 632.7917 on 172 degrees of freedom
## Residual Deviance: 555.1206 on 168 degrees of freedom
## AIC: 920.4184 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## s(width)    1  55.55  55.545  17.473 4.679e-05 ***
## Residuals 168 534.06   3.179                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar Chisq  P(Chi)   
## (Intercept)                              
## s(width)          3     11.719 0.00841 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# s() is smooth function predictor for generalized additive model
curve(predict(gam.fit, data.frame(width=x), type="resp"), add=TRUE)

#exp(0.164) = 1.178 represents the multiplicative effect on the fitted value for each 1-cm increase in x.
#For instance, the fitted value at x = 27.3 = 26.3 + 1 is exp[−3.305 + 0.164(27.3)] = 3.23, which equals 1.178(2.74).

STATISTICAL INFERENCE AND MODEL CHECKING Wald, Likelihood-Ratio and Score Inference Use the Likelihood Function

Evo <- read.table("http://www.stat.ufl.edu/~aa/cat/data/Evolution.dat", header=TRUE)
Evo

##   ideology true false
## 1        1   11    37
## 2        2   46   104
## 3        3   70    72
## 4        4  241   214
## 5        5   78    36
## 6        6   89    24
## 7        7   36     6

n <- Evo$true + Evo$false # binomial sample sizes
fit <- glm(true/n ~ ideology, family=binomial, weights=n, data=Evo)
summary(fit) # logistic regression fit

## 
## Call:
## glm(formula = true/n ~ ideology, family = binomial, data = Evo, 
##     weights = n)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7  
##  0.1430  -0.2697   1.4614  -1.0791   0.2922   0.4471   0.2035  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.75658    0.20500  -8.569   <2e-16 ***
## ideology     0.49422    0.05092   9.706   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 113.20  on 6  degrees of freedom
## Residual deviance:   3.72  on 5  degrees of freedom
## AIC: 42.332
## 
## Number of Fisher Scoring iterations: 3

confint(fit) # profile likelihood CI

## Waiting for profiling to be done...

##                 2.5 %     97.5 %
## (Intercept) -2.165294 -1.3609733
## ideology     0.396166  0.5959414

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

Anova(fit)# likelihood-ratio tests for effect parameters in a GLM

## Analysis of Deviance Table (Type II tests)
## 
## Response: true/n
##          LR Chisq Df Pr(>Chisq)    
## ideology   109.48  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(statmod)

## Warning: package 'statmod' was built under R version 4.1.3

fit0 <- glm(true/n ~ 1, family=binomial, weights=n, data=Evo) # null model
#glm.scoretest(fit0, Evo$ideology)
#glm.scoretest(fit0, Evo$ideology)ˆ2 # squaring a z score statistic
# 104.101 # score chi-squared statistic with df=1

TDEVIANCE The deviance is the likelihood ratio statistics for comparing model M to the saturted mode It represents the square root of the contribution that each data point has to the overall residual deviance are analogous in the ordinary least squares, when we square the deviance residual we get the residual deviance that we use to asses how well a model fits the data, we can use the t identify the outliers

For such cases, the deviance provides a goodness-of-fit test of the model, because it tests the hypothesis that all possible parameters not included in the model equal 0. the residual deviance for a model is the likelihood-ratio statistic for testing that all the β effect terms in the model equal 0. Deviance residuals are alternative measures of lack of fit that decompose the deviance

The residual df equals the number of observations minus the number ofmodel parameters.

we can compare the models by comparing their deviances. This test statistic is large when M0 fits poorly compared to M1.

attach(Evo)
cbind(ideology,true,false,n,true/n,fitted(fit),rstandard(fit,type="pearson"))

##   ideology true false   n                               
## 1        1   11    37  48 0.2291667 0.2205679  0.1611162
## 2        2   46   104 150 0.3066667 0.3168813 -0.3515386
## 3        3   70    72 142 0.4929577 0.4319445  1.6480176
## 4        4  241   214 455 0.5296703 0.5548525 -1.4995488
## 5        5   78    36 114 0.6842105 0.6713982  0.3248519
## 6        6   89    24 113 0.7876106 0.7700750  0.5413625
## 7        7   36     6  42 0.8571429 0.8459201  0.2206605

# rstandard(fit, type="pearson") requests standardized residuals
#ideology|| true || false|| n || # sample|| fitted | std. res.

3.5 FITTING GENERALIZED LINEAR MODELS 3.5.1 The Fisher Scoring Algorithm Fits GLMs

Generalized linear model,ch 3

Ordinary regression and ANOVA

AleidaRamirez

5/12/2022