A GLM is different than simple/multivariate linear regression and is a special case of regression models.
One of the main properties of simple linear models or linear multivarite models is the linear additive response of the predictors. The response (dependent, response or explaned variable) can be estimated from a model that has additive properties from the covariates or explanatory (independent) variables. However, there are events in real life where such a linear combinations is not possible. For example, events that take only a set of outcomes (the result of a match between two teams) or count events that are distributed over time (arrivals, queues) do not respond to a linear model.
GLM takes care of some isntances that simple models struggle with. It is done at a expense of some of the nityness and mathematical assumptions on which simple multivariate linear models are based on. As usefull as linear models are, they strugggle to address the complexity of certain type of events: i.e.: binary outcomes. It is difficult to apply the additive response to a gaussian outcome.
A GLM has three components:
GLM: THE LOGISTIC REGRESSION
One of the most widely used GLM is the logistic regression. It uses a logit to predict the likelihood of an outcome (Bernoulli type event) based on a linear response of predictors.
Y~ Bernoully(U)i -> E[Yi]=U where 0<=u<=1
According to Bernulli, the distribution of the probability of one of the two possible outcomes (Yi) has an expected value of Ui. Then, the response model Yi follows a Bernoulli distribution where Ui can take any value between 1 and 0.
Let Ni denote the linear function
\[ \Sigma \]
Xik*Bk
It is a linear additive function that results of multiplying the covariates times the coefficients estimated from the linear preditor (combination of coeffcients times teh values of the covariates).
The logistic link function in this case is the log(u/(1-u)) or logit.
The way how the probability of the mean will be estimated based on the linear predictors is through the odds represented by the logit funtion g(u)=N=log(u/(1-u)). The Odds are the probability of p/q -> u/(1-u). Then in order to connect to a logistic model based on the linear predictor model, the log of the odds (the log of the odds of the event occurring -the logit) will be used.
According to Bernoulli, the probability function of Y is
Sigma(u(exp Y)*(1-u)Exp(1-y))
then EXP(log(u/(1-u))) = Xik*Bk By inversing both sides of the equation the probabily of the logit of the odds can be expressed as a linear function.
load("C:/Users/Jen/Downloads/ravensData.rda")
head(ravensData, 20)
## ravenWinNum ravenWin ravenScore opponentScore
## 1 1 W 24 9
## 2 1 W 38 35
## 3 1 W 28 13
## 4 1 W 34 31
## 5 1 W 44 13
## 6 0 L 23 24
## 7 1 W 31 30
## 8 1 W 23 16
## 9 1 W 9 6
## 10 1 W 31 29
## 11 0 L 13 43
## 12 1 W 25 15
## 13 1 W 55 20
## 14 1 W 13 10
## 15 1 W 16 13
## 16 0 L 20 23
## 17 0 L 28 31
## 18 0 L 17 34
## 19 1 W 33 14
## 20 0 L 17 23
summary(ravensData)
## ravenWinNum ravenWin ravenScore opponentScore
## Min. :0.0 L: 6 Min. : 9.0 Min. : 6.00
## 1st Qu.:0.0 W:14 1st Qu.:17.0 1st Qu.:13.00
## Median :1.0 Median :24.5 Median :21.50
## Mean :0.7 Mean :26.1 Mean :21.60
## 3rd Qu.:1.0 3rd Qu.:31.5 3rd Qu.:30.25
## Max. :1.0 Max. :55.0 Max. :43.00
How do the distributions of the dependent binomial (response) and the ‘independent’ (predictor) variables look like?
plot(ravensData$ravenScore)
plot(ravensData$ravenWinNum)
plot(ravensData$ravenScore, ravensData$ravenWinNum)
Now the Logistic Regression in R:
LogRavensData <- glm(ravensData$ravenWinNum~ravensData$ravenScore, family = "binomial")
summary(LogRavensData)
##
## Call:
## glm(formula = ravensData$ravenWinNum ~ ravensData$ravenScore,
## family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7575 -1.0999 0.5305 0.8060 1.4947
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.68001 1.55412 -1.081 0.28
## ravensData$ravenScore 0.10658 0.06674 1.597 0.11
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.435 on 19 degrees of freedom
## Residual deviance: 20.895 on 18 degrees of freedom
## AIC: 24.895
##
## Number of Fisher Scoring iterations: 5
Fitting the Score Data to the Model:
plot(ravensData$ravenScore, LogRavensData$fitted, pch=19, col="blue", xlab="Score",ylab="Prob of Wining")
Odd Ratios and Coefficient Interpretation:
exp(LogRavensData$coeff)
## (Intercept) ravensData$ravenScore
## 0.1863724 1.1124694
The EXP(B) coefficient indicates there is an 11% increase in the probability of the team winning for each adittional point that the team scores.
The EXP(A) coefficient (intercept) indicates the log(odds) of the team winning is 18% when the score is zero (????)
Confidence Intervals
exp(confint(LogRavensData))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.005674966 3.106384
## ravensData$ravenScore 0.996229662 1.303304
The 5% confidence interval (two tailed) includes the value of 1 for the slope (B coefficient), which indicates the B coefficient is not significant at a 5% confidence level.
anova(LogRavensData,test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: ravensData$ravenWinNum
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 19 24.435
## ravensData$ravenScore 1 3.5398 18 20.895 0.05991 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
GLM: The Poisson Distribution
P(k events in an interval) = (Lambda(exp K)*EXP(exp -Lambda))/K factorial
mean of the poisson: E(k)= Lambda Variance : Var(k)= Lambda
Lambda = rate of events over period of time. When Lambda * K is large, Poisson tends to become normal (see charts bellow for demonstration)
par(mfrow=c(1,3))
plot(0:10, dpois(0:10,lambda = 2), type="h", frame=FALSE)
plot(0:20, dpois(0:20,lambda = 10), type="h", frame=FALSE)
plot(0:200, dpois(0:200,lambda = 100), type="h", frame=FALSE)