Generalized Linear Models in R

GLM Models

A GLM is different than simple/multivariate linear regression and is a special case of regression models.

One of the main properties of simple linear models or linear multivarite models is the linear additive response of the predictors. The response (dependent, response or explaned variable) can be estimated from a model that has additive properties from the covariates or explanatory (independent) variables. However, there are events in real life where such a linear combinations is not possible. For example, events that take only a set of outcomes (the result of a match between two teams) or count events that are distributed over time (arrivals, queues) do not respond to a linear model.

GLM takes care of some isntances that simple models struggle with. It is done at a expense of some of the nityness and mathematical assumptions on which simple multivariate linear models are based on. As usefull as linear models are, they strugggle to address the complexity of certain type of events: i.e.: binary outcomes. It is difficult to apply the additive response to a gaussian outcome.

A GLM has three components:

a response model, based on the exponential family of functions.
a systematic component via a linear predictor or set of predictors.
a link fucntions that connects the linear predictor (systematic component) to the response model.

GLM: THE LOGISTIC REGRESSION

One of the most widely used GLM is the logistic regression. It uses a logit to predict the likelihood of an outcome (Bernoulli type event) based on a linear response of predictors.

Response model for Logostic Regression Data that has only possibe binary outcomes (identified as zero when no occurrence or 1 when occurrence) doesn’t follow a normal distribution. The only known probability distribution for such data is Bernulli.

Y~ Bernoully(U)i -> E[Yi]=U where 0<=u<=1

According to Bernulli, the distribution of the probability of one of the two possible outcomes (Yi) has an expected value of Ui. Then, the response model Yi follows a Bernoulli distribution where Ui can take any value between 1 and 0.

Linear Preditor

Let Ni denote the linear function

\[ \Sigma \]

Xik*Bk

It is a linear additive function that results of multiplying the covariates times the coefficients estimated from the linear preditor (combination of coeffcients times teh values of the covariates).

Link Function

The logistic link function in this case is the log(u/(1-u)) or logit.

The way how the probability of the mean will be estimated based on the linear predictors is through the odds represented by the logit funtion g(u)=N=log(u/(1-u)). The Odds are the probability of p/q -> u/(1-u). Then in order to connect to a logistic model based on the linear predictor model, the log of the odds (the log of the odds of the event occurring -the logit) will be used.

According to Bernoulli, the probability function of Y is

Sigma(u(exp Y)*(1-u)Exp(1-y))

then EXP(log(u/(1-u))) = Xik*Bk By inversing both sides of the equation the probabily of the logit of the odds can be expressed as a linear function.

load("C:/Users/Jen/Downloads/ravensData.rda")
head(ravensData, 20)

##    ravenWinNum ravenWin ravenScore opponentScore
## 1            1        W         24             9
## 2            1        W         38            35
## 3            1        W         28            13
## 4            1        W         34            31
## 5            1        W         44            13
## 6            0        L         23            24
## 7            1        W         31            30
## 8            1        W         23            16
## 9            1        W          9             6
## 10           1        W         31            29
## 11           0        L         13            43
## 12           1        W         25            15
## 13           1        W         55            20
## 14           1        W         13            10
## 15           1        W         16            13
## 16           0        L         20            23
## 17           0        L         28            31
## 18           0        L         17            34
## 19           1        W         33            14
## 20           0        L         17            23

summary(ravensData)

##   ravenWinNum  ravenWin   ravenScore   opponentScore  
##  Min.   :0.0   L: 6     Min.   : 9.0   Min.   : 6.00  
##  1st Qu.:0.0   W:14     1st Qu.:17.0   1st Qu.:13.00  
##  Median :1.0            Median :24.5   Median :21.50  
##  Mean   :0.7            Mean   :26.1   Mean   :21.60  
##  3rd Qu.:1.0            3rd Qu.:31.5   3rd Qu.:30.25  
##  Max.   :1.0            Max.   :55.0   Max.   :43.00

The RavensData Plots

How do the distributions of the dependent binomial (response) and the ‘independent’ (predictor) variables look like?

plot(ravensData$ravenScore)

plot(ravensData$ravenWinNum)

plot(ravensData$ravenScore, ravensData$ravenWinNum)

Now the Logistic Regression in R:

LogRavensData <- glm(ravensData$ravenWinNum~ravensData$ravenScore, family = "binomial")
summary(LogRavensData)

## 
## Call:
## glm(formula = ravensData$ravenWinNum ~ ravensData$ravenScore, 
##     family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7575  -1.0999   0.5305   0.8060   1.4947  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -1.68001    1.55412  -1.081     0.28
## ravensData$ravenScore  0.10658    0.06674   1.597     0.11
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 20.895  on 18  degrees of freedom
## AIC: 24.895
## 
## Number of Fisher Scoring iterations: 5

Fitting the Score Data to the Model:

plot(ravensData$ravenScore, LogRavensData$fitted, pch=19, col="blue", xlab="Score",ylab="Prob of Wining")

Odd Ratios and Coefficient Interpretation:

exp(LogRavensData$coeff)

##           (Intercept) ravensData$ravenScore 
##             0.1863724             1.1124694

The EXP(B) coefficient indicates there is an 11% increase in the probability of the team winning for each adittional point that the team scores.

The EXP(A) coefficient (intercept) indicates the log(odds) of the team winning is 18% when the score is zero (????)

Confidence Intervals

exp(confint(LogRavensData))

## Waiting for profiling to be done...

##                             2.5 %   97.5 %
## (Intercept)           0.005674966 3.106384
## ravensData$ravenScore 0.996229662 1.303304

The 5% confidence interval (two tailed) includes the value of 1 for the slope (B coefficient), which indicates the B coefficient is not significant at a 5% confidence level.

anova(LogRavensData,test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: ravensData$ravenWinNum
## 
## Terms added sequentially (first to last)
## 
## 
##                       Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
## NULL                                     19     24.435           
## ravensData$ravenScore  1   3.5398        18     20.895  0.05991 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLM: The Poisson Distribution

P(k events in an interval) = (Lambda(exp K)*EXP(exp -Lambda))/K factorial

mean of the poisson: E(k)= Lambda Variance : Var(k)= Lambda

Lambda = rate of events over period of time. When Lambda * K is large, Poisson tends to become normal (see charts bellow for demonstration)

par(mfrow=c(1,3))
plot(0:10, dpois(0:10,lambda = 2), type="h", frame=FALSE)
plot(0:20, dpois(0:20,lambda = 10), type="h", frame=FALSE)
plot(0:200, dpois(0:200,lambda = 100), type="h", frame=FALSE)

Generalized Linear Models in R

Mauricio Vasquez

October 15, 2016

GLM Models

The RavensData Plots