Generalized Linear Models

setwd("~/Desktop/University of Utah PhD /Course Work/Fall 2022 Semester/GEOG6000_Data Analysis/lab05")

Binomial Model 1

Used for binary (0, 1) data.

irished = read.csv("../datafiles/irished.csv", header = TRUE)

##Change the numerical values to categorical data with labels

irished$sex = factor(irished$sex,
                     levels = c(1,2),
                     labels = c("Male", "Female"))
irished$lvcert = factor(irished$lvcert,
                        levels = c(0,1),
                        labels = c("Not Taken", "Taken"))

##Center the DVRT Score

mean(irished$DVRT) #For reference

## [1] 100.152

irished$DVRT.cen = irished$DVRT - mean(irished$DVRT)

##Boxplot to examine the centered DVRT score and the leaving certificate

boxplot(DVRT.cen ~ lvcert, data = irished)

Building a GLM for irished dataset

##For a binomial distribution (0,1 data), the link function is a logit function which uses the natural log (see slides). This acts to make the relationship linear

irished.glm1 = glm(lvcert ~ DVRT.cen, data = irished,
                   family = binomial(link = "logit"))

summary(irished.glm1)

## 
## Call:
## glm(formula = lvcert ~ DVRT.cen, family = binomial(link = "logit"), 
##     data = irished)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9077  -0.9810  -0.4543   1.0307   2.1552  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.278422   0.099665  -2.794  0.00521 ** 
## DVRT.cen     0.064369   0.007576   8.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 686.86  on 499  degrees of freedom
## Residual deviance: 593.77  on 498  degrees of freedom
## AIC: 597.77
## 
## Number of Fisher Scoring iterations: 3

##Converting the coefficients from log-odds to odds

exp(coef(irished.glm1))

## (Intercept)    DVRT.cen 
##   0.7569771   1.0664856

Coefficient Interpretation: Based on these coefficients, for a student with an average DVRT score, they have approximately a 0.76 odds of obtaining a leaving certificate. For an increase of one unit in DVRT score, the student’s odds will increase by 1.07. This change is multiplicative meaning if the student’s DVRT score changes by two units, the chances will increase by, \(Odds = (0.76*1.07)*1.07\), and so on.
NOTE: This output is in **odds* not probability.
Odds to Probability is \(P = Odds/(1+Odds)\). In this case, the probability is 0.43.

0.76/(1+0.76)

## [1] 0.4318182

Prediction Using this Model

##We need new data to run the model. The independent variable, DVRT, needs to be converted to a centered score in order for the model to use it as we used a centered DVRT score to construct the model.

newDVRT = data.frame(DVRT.cen = 120 - mean(irished$DVRT)) ##DVRT.cen becomes the column header

predict(irished.glm1,
        newdata = newDVRT,
        type = 'response',
        se.fit = TRUE)

## $fit
##        1 
## 0.730895 
## 
## $se.fit
##          1 
## 0.03350827 
## 
## $residual.scale
## [1] 1

The probability is 0.73 that the student will obtain the certificate. The coefficients from the model are in odds. NOTE: This output gives you the probability due to the type = ‘response’ syntax. These are not equivalent.
Odds are the ratios of probabilities between an event happening versus the probability that it won’t. \(Odds in favor of A = A/(1-A)\)

newDVRT.2 = data.frame(DVRT.cen = seq(60,160) - mean(irished$DVRT))

lvcert.pred = predict(irished.glm1,
                      newdata = newDVRT.2,
                      type = 'response')

##Plotting the DVRT score sequence 60 - 160 with the probability of obtaining the leaving certificate. It should appear like this graph with asymptotes as they approach 0% and 100% probability

plot(newDVRT.2$DVRT.cen + mean(irished$DVRT),
     lvcert.pred,
     type = 'l',
     col = 2,
     lwd = 2,
     xlab = 'DVRT',
     ylab = 'Pr(lvcert)')

anova(irished.glm1, test = 'Chisq')

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: lvcert
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                       499     686.86              
## DVRT.cen  1   93.095       498     593.77 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, given the large F-statistic and small p-value we would reject the null hypothesis that the null (intercept model) is better. We accept the alternative hypothesis that the our current model is better.

Binomial Models 2

turbines = read.csv("../datafiles/turbines.csv")
head(turbines)

##   Hours Turbines Fissures
## 1   400       39        0
## 2  1000       53        4
## 3  1400       33        2
## 4  1800       73        7
## 5  2200       30        5
## 6  2600       39        9

##Calculating the proportion of fissures to turbines
##The plot includes additional observations? Or is this a prediction?
turbines$prop = turbines$Fissures/turbines$Turbines 
plot(prop ~ Hours, data = turbines)

turbine.glm = glm(prop ~ Hours,
                  data = turbines,
                  family = binomial(link = 'logit'),
                  weights = Turbines)

turbine.odds = exp(coef(turbine.glm))
turbine.odds

## (Intercept)       Hours 
##  0.01976986  1.00099974

Prediction

newturbine = data.frame(Hours = 5000)
predicted_probability = predict(turbine.glm,
        newdata = newturbine,
        type = 'response',
        se.fit = TRUE)
predicted_probability

## $fit
##         1 
## 0.7450891 
## 
## $se.fit
##          1 
## 0.04753524 
## 
## $residual.scale
## [1] 1

Poisson Models

Used for count data.

hsa = read.csv("../datafiles/hsa.csv", header = TRUE)
hsa$prog = factor(hsa$prog,
                  levels = c(1, 2, 3),
                  labels = c("General", "Academic", "Vocational"))

boxplot(math ~ num_awards, data = hsa)

boxplot(math ~ prog, data = hsa)

Within this dataset there seems to be a relationship between the average math score and the number of awards. Additionally, the average math score varies by program.

Building a Poisson Model

hsa.glm = glm(num_awards ~ math + prog,
              family = poisson(link = 'log'),
              data = hsa)

summary(hsa.glm)

## 
## Call:
## glm(formula = num_awards ~ math + prog, family = poisson(link = "log"), 
##     data = hsa)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.8436  -0.5106   0.2558   2.6796  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.24712    0.65845  -7.969 1.60e-15 ***
## math            0.07015    0.01060   6.619 3.63e-11 ***
## progAcademic    1.08386    0.35825   3.025  0.00248 ** 
## progVocational  0.36981    0.44107   0.838  0.40179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

hsa.glm.transformed.coef = exp(coef(hsa.glm))
hsa.glm.transformed.coef

##    (Intercept)           math   progAcademic progVocational 
##     0.00526263     1.07267164     2.95606545     1.44745846

Coefficient Interpretation

The intercept is the expected number of awards for a student with a zero math score in the general program. This is not particularly useful, however, we could re-run this analysis with a centered math score to produce a more meaningful intercept term.
The math coefficient, 1.07, is the award rate increase for a unit increase in math score. This is a positive relationship between math score and the number of awards.
The academic program coefficient, 2.96, is the award rate increase for students in this group.
The vocational program coefficient, 1.45, is the award rate increase for students in this group.
Note: These coefficients are multiplicative.

Prediction Model

##Create new data with a column for math and program - the same variables in the model. 
hsa.new = data.frame(math = 70, prog = 'Academic')

predict(hsa.glm,
        newdata = hsa.new,
        type = 'response',
        se.fit = TRUE)

## $fit
##        1 
## 2.111508 
## 
## $se.fit
##        1 
## 0.275062 
## 
## $residual.scale
## [1] 1

The expected number of awards is 2 for this prediction.

##Re-run the prediction for a student from the general program.
hsa.new2 = data.frame(math = 70, prog = 'General')

predict(hsa.glm,
        newdata = hsa.new2,
        type = 'response',
        se.fit = TRUE)

## $fit
##         1 
## 0.7142969 
## 
## $se.fit
##         1 
## 0.2686186 
## 
## $residual.scale
## [1] 1

The expected number of awards is approximately 1 (0.71).

EXERCISES

1. Bird Species in the Mediterranean

1.1 Boxplots

island = read.csv("../datafiles/island2.csv", header = TRUE)


##Note from Simon -
##Don’t confuse the reference for a covariate dummy variable with the 0/1 outcome of a binary variable. The coefficients in the model are the probably of getting a ‘success’, or a 1 outcome. So your intercept is the probability of 1 or presence of the bird
##Label the incidence as a factor and label it. The assumption is 0 means absent and 1 is present. 

island$incidence = factor(island$incidence,
                          levels = c(0,1),
                          labels = c("Absent", "Present"))

##Centered Data
island$area.cen = island$area - (mean(island$area))
island$isolation.cen = island$isolation - (mean(island$isolation))

par(mfrow = c(1, 3)) ##Creating a multiple plot area with 1 row and 3 columns

boxplot(area.cen ~ incidence, data = island,
        ylab = "Area",
        xlab = "Incidence",
        main = "Incidence by Centered Area",
         col = c("darksalmon", "darkseagreen"))

boxplot(isolation.cen ~ incidence, data = island,
        ylab = "Isolation",
        xlab = "Incidence",
        main = "Incidence by Centered Isolation",
        col = c("darksalmon", "darkseagreen"))

boxplot(quality ~ incidence, data = island,
        ylab = "Quality",
        xlab = "Incidence",
        main = "Incidence by Quality",
         col = c("darksalmon", "darkseagreen"))

Based on these initial boxplots, the presence of birds (incidence) seems to correlate with the island’s area and isolation. The relationship between area and presence is positive while the relationship between isolation and presence is negative.

1.2 Generalized Linear Model Build

##Initial uncentered model

bird.pres.uncent.glm = glm(incidence ~ area + isolation,
                    data = island,
                    family = binomial(link = "logit"))

summary(bird.pres.uncent.glm)

## 
## Call:
## glm(formula = incidence ~ area + isolation, family = binomial(link = "logit"), 
##     data = island)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8189  -0.3089   0.0490   0.3635   2.1192  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   6.6417     2.9218   2.273  0.02302 * 
## area          0.5807     0.2478   2.344  0.01909 * 
## isolation    -1.3719     0.4769  -2.877  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.029  on 49  degrees of freedom
## Residual deviance: 28.402  on 47  degrees of freedom
## AIC: 34.402
## 
## Number of Fisher Scoring iterations: 6

##Calculating the log odds to odds for the uncentered model
incidence.uncen.odds = exp(coef(bird.pres.uncent.glm))
incidence.uncen.odds

## (Intercept)        area   isolation 
## 766.3669575   1.7873322   0.2536142

##Calculating the uncentered model probability
Intercept.uncent.prob = 766.3669575 /(1+766.3669575 )
Intercept.uncent.prob

## [1] 0.9986968

##Centered Model
bird.pres.glm = glm(incidence ~ area.cen + isolation.cen,
                    data = island,
                    family = binomial(link = "logit"))

summary(bird.pres.glm)

## 
## Call:
## glm(formula = incidence ~ area.cen + isolation.cen, family = binomial(link = "logit"), 
##     data = island)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8189  -0.3089   0.0490   0.3635   2.1192  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     1.1154     0.5877   1.898  0.05770 . 
## area.cen        0.5807     0.2478   2.344  0.01909 * 
## isolation.cen  -1.3719     0.4769  -2.877  0.00401 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.029  on 49  degrees of freedom
## Residual deviance: 28.402  on 47  degrees of freedom
## AIC: 34.402
## 
## Number of Fisher Scoring iterations: 6

##Convert the coefficients from log odds to odds

incidence.odds = exp(coef(bird.pres.glm))
incidence.odds

##   (Intercept)      area.cen isolation.cen 
##     3.0507967     1.7873322     0.2536142

##To convert from odds to probability the equation is P = Odds/(1+Odds)
Intercept.prob = 3.0507967/(1+3.0507967)
Intercept.prob

## [1] 0.753135

1.3 Coefficient and AIC Interpretation

The AIC is 34.4 which can be used to compare this model against other models to determine which model is best for describing this data set.
The intercept for the uncentered model is 766.37 which, is a probability of almost 1 (0.998). This means for that for an island with an area of zero and isolation of zero, there is almost statistical certainty that there will be bids present. This is not necessarily a useful interpretation for the intercept. After re-constructing the model with centered data, the intercept is 3.05 which means there is a probability of 0.75 that there is a presence of birds on an island with an average area and isolation.
The area coefficient is 1.79 which is a multiplicative coefficient that as the area increases by one unit, the odds of presence increase by a rate of 1.79.
The isolation coefficient is 0.25 which is also a multiplicative coefficient that as isolation increases by one unit, the odds of presence decrease by a rate of 0.25. Since this value is less than 1, it indicates that it negatively affects the presence of birds.

1.4 Prediction

##Create a new data frame
##Use the uncentered model as the new data is given in the original scale.

bird.data.new = data.frame(area = 5, isolation = 6)
bird.predict = predict.glm(bird.pres.uncent.glm,
                       newdata = bird.data.new,
                       type = 'response',
                       se.fit = TRUE)

bird.predict

## $fit
##         1 
## 0.7881208 
## 
## $se.fit
##         1 
## 0.1125028 
## 
## $residual.scale
## [1] 1

This output tells us that the probability of bird presence on this new island is approximately 0.79 with a standard error of ± 0.11.

2. Hemlock Trees in the Smoky Mountains

2.1 Model Build

tsuga = read.csv("../datafiles/tsuga.csv", header = TRUE)

tsuga.glm = glm(cover ~ streamdist + elev,
                data = tsuga,
                family = poisson(link = 'log'))

2.2 Coefficient Interpretation

summary(tsuga.glm)

## 
## Call:
## glm(formula = cover ~ streamdist + elev, family = poisson(link = "log"), 
##     data = tsuga)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.31395  -0.82155  -0.07929   0.71900   2.62316  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.622e+00  5.226e-02  31.047  < 2e-16 ***
## streamdist  -8.963e-04  1.173e-04  -7.641 2.15e-14 ***
## elev         8.901e-05  5.653e-05   1.575    0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 748.23  on 744  degrees of freedom
## Residual deviance: 687.10  on 742  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 3150.2
## 
## Number of Fisher Scoring iterations: 4

The intercept, 1.622, are the log-odds for the abundance of Hemlock trees given a zero value for both elevation and distance to stream. The p-value (2e-16) is less than the critical threshold of 0.05 meaning it is statistically significant.
The distance to stream coefficient, -8.963e-04, are the log-odds for the abundance of Hemlock trees if the elevation is held constant. The p-value (2.15e-14) is well below the critical threshold of 0.05 meaning it is statistically significant as an explanatory variable.
The elevation coefficient, 8.901e-05, are the log-odds for the abundance of Hemlock trees with other variables being held constant. The p-value is 0.115 which is above our critical threshold of 0.05 which means it is not statistically significant as an explanatory variable. -The model AIC is 3150 which can be used to compare models.

2.3 Coefficient Transformation

tsuga.glm.coef = exp(coef(tsuga.glm))
tsuga.glm.coef

## (Intercept)  streamdist        elev 
##   5.0652901   0.9991041   1.0000890

2.4 Model Interpretation

This model attempts to explain the abundance of Hemlock trees as a function of distance to a stream and elevation. With a stream distance of zero (0) and elevation of zero (0), the abundance of trees would be classified as approximately five (5). The abundance changes negatively for an increase in distance to the stream. As poisson generalized linear model rates are multiplicative and the stream distance coefficient is less than 1 - in this case 0.999, this means the relationship is negative. The elevation coefficient is 1, and statistically insignificant meaning that elevation does not play an important role in the abundance of Hemlock trees.

Leydet - GEOG6000 Lab 5 - Modeling III

David Leydet

2022-09-21

Generalized Linear Models

Binomial Model 1

Building a GLM for irished dataset

Prediction Using this Model

Binomial Models 2

Poisson Models

Building a Poisson Model

EXERCISES

1. Bird Species in the Mediterranean

1.1 Boxplots

1.2 Generalized Linear Model Build

1.3 Coefficient and AIC Interpretation

1.4 Prediction

2. Hemlock Trees in the Smoky Mountains

2.1 Model Build

2.2 Coefficient Interpretation

2.3 Coefficient Transformation

2.4 Model Interpretation