setwd("~/Desktop/University of Utah PhD /Course Work/Fall 2022 Semester/GEOG6000_Data Analysis/lab05")
Used for binary (0, 1) data.
irished = read.csv("../datafiles/irished.csv", header = TRUE)
##Change the numerical values to categorical data with labels
irished$sex = factor(irished$sex,
levels = c(1,2),
labels = c("Male", "Female"))
irished$lvcert = factor(irished$lvcert,
levels = c(0,1),
labels = c("Not Taken", "Taken"))
##Center the DVRT Score
mean(irished$DVRT) #For reference
## [1] 100.152
irished$DVRT.cen = irished$DVRT - mean(irished$DVRT)
##Boxplot to examine the centered DVRT score and the leaving certificate
boxplot(DVRT.cen ~ lvcert, data = irished)
##For a binomial distribution (0,1 data), the link function is a logit function which uses the natural log (see slides). This acts to make the relationship linear
irished.glm1 = glm(lvcert ~ DVRT.cen, data = irished,
family = binomial(link = "logit"))
summary(irished.glm1)
##
## Call:
## glm(formula = lvcert ~ DVRT.cen, family = binomial(link = "logit"),
## data = irished)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9077 -0.9810 -0.4543 1.0307 2.1552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.278422 0.099665 -2.794 0.00521 **
## DVRT.cen 0.064369 0.007576 8.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 686.86 on 499 degrees of freedom
## Residual deviance: 593.77 on 498 degrees of freedom
## AIC: 597.77
##
## Number of Fisher Scoring iterations: 3
##Converting the coefficients from log-odds to odds
exp(coef(irished.glm1))
## (Intercept) DVRT.cen
## 0.7569771 1.0664856
Coefficient Interpretation: Based on these coefficients, for a student with an average DVRT score, they have approximately a 0.76 odds of obtaining a leaving certificate. For an increase of one unit in DVRT score, the student’s odds will increase by 1.07. This change is multiplicative meaning if the student’s DVRT score changes by two units, the chances will increase by, \(Odds = (0.76*1.07)*1.07\), and so on.
NOTE: This output is in **odds* not probability.
Odds to Probability is \(P = Odds/(1+Odds)\). In this case, the probability is 0.43.
0.76/(1+0.76)
## [1] 0.4318182
##We need new data to run the model. The independent variable, DVRT, needs to be converted to a centered score in order for the model to use it as we used a centered DVRT score to construct the model.
newDVRT = data.frame(DVRT.cen = 120 - mean(irished$DVRT)) ##DVRT.cen becomes the column header
predict(irished.glm1,
newdata = newDVRT,
type = 'response',
se.fit = TRUE)
## $fit
## 1
## 0.730895
##
## $se.fit
## 1
## 0.03350827
##
## $residual.scale
## [1] 1
The probability is 0.73 that the student will obtain the certificate. The coefficients from the model are in odds. NOTE: This output gives you the probability due to the type = ‘response’ syntax. These are not equivalent.
Odds are the ratios of probabilities between an event happening versus the probability that it won’t. \(Odds in favor of A = A/(1-A)\)
newDVRT.2 = data.frame(DVRT.cen = seq(60,160) - mean(irished$DVRT))
lvcert.pred = predict(irished.glm1,
newdata = newDVRT.2,
type = 'response')
##Plotting the DVRT score sequence 60 - 160 with the probability of obtaining the leaving certificate. It should appear like this graph with asymptotes as they approach 0% and 100% probability
plot(newDVRT.2$DVRT.cen + mean(irished$DVRT),
lvcert.pred,
type = 'l',
col = 2,
lwd = 2,
xlab = 'DVRT',
ylab = 'Pr(lvcert)')
anova(irished.glm1, test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: lvcert
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 499 686.86
## DVRT.cen 1 93.095 498 593.77 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
turbines = read.csv("../datafiles/turbines.csv")
head(turbines)
## Hours Turbines Fissures
## 1 400 39 0
## 2 1000 53 4
## 3 1400 33 2
## 4 1800 73 7
## 5 2200 30 5
## 6 2600 39 9
##Calculating the proportion of fissures to turbines
##The plot includes additional observations? Or is this a prediction?
turbines$prop = turbines$Fissures/turbines$Turbines
plot(prop ~ Hours, data = turbines)
turbine.glm = glm(prop ~ Hours,
data = turbines,
family = binomial(link = 'logit'),
weights = Turbines)
turbine.odds = exp(coef(turbine.glm))
turbine.odds
## (Intercept) Hours
## 0.01976986 1.00099974
Prediction
newturbine = data.frame(Hours = 5000)
predicted_probability = predict(turbine.glm,
newdata = newturbine,
type = 'response',
se.fit = TRUE)
predicted_probability
## $fit
## 1
## 0.7450891
##
## $se.fit
## 1
## 0.04753524
##
## $residual.scale
## [1] 1
Used for count data.
hsa = read.csv("../datafiles/hsa.csv", header = TRUE)
hsa$prog = factor(hsa$prog,
levels = c(1, 2, 3),
labels = c("General", "Academic", "Vocational"))
boxplot(math ~ num_awards, data = hsa)
boxplot(math ~ prog, data = hsa)
hsa.glm = glm(num_awards ~ math + prog,
family = poisson(link = 'log'),
data = hsa)
summary(hsa.glm)
##
## Call:
## glm(formula = num_awards ~ math + prog, family = poisson(link = "log"),
## data = hsa)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2043 -0.8436 -0.5106 0.2558 2.6796
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.24712 0.65845 -7.969 1.60e-15 ***
## math 0.07015 0.01060 6.619 3.63e-11 ***
## progAcademic 1.08386 0.35825 3.025 0.00248 **
## progVocational 0.36981 0.44107 0.838 0.40179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
hsa.glm.transformed.coef = exp(coef(hsa.glm))
hsa.glm.transformed.coef
## (Intercept) math progAcademic progVocational
## 0.00526263 1.07267164 2.95606545 1.44745846
Coefficient Interpretation
The intercept is the expected number of awards for a student with a zero math score in the general program. This is not particularly useful, however, we could re-run this analysis with a centered math score to produce a more meaningful intercept term.
The math coefficient, 1.07, is the award rate increase for a unit increase in math score. This is a positive relationship between math score and the number of awards.
The academic program coefficient, 2.96, is the award rate increase for students in this group.
The vocational program coefficient, 1.45, is the award rate increase for students in this group.
Note: These coefficients are multiplicative.
Prediction Model
##Create new data with a column for math and program - the same variables in the model.
hsa.new = data.frame(math = 70, prog = 'Academic')
predict(hsa.glm,
newdata = hsa.new,
type = 'response',
se.fit = TRUE)
## $fit
## 1
## 2.111508
##
## $se.fit
## 1
## 0.275062
##
## $residual.scale
## [1] 1
##Re-run the prediction for a student from the general program.
hsa.new2 = data.frame(math = 70, prog = 'General')
predict(hsa.glm,
newdata = hsa.new2,
type = 'response',
se.fit = TRUE)
## $fit
## 1
## 0.7142969
##
## $se.fit
## 1
## 0.2686186
##
## $residual.scale
## [1] 1
island = read.csv("../datafiles/island2.csv", header = TRUE)
##Note from Simon -
##Don’t confuse the reference for a covariate dummy variable with the 0/1 outcome of a binary variable. The coefficients in the model are the probably of getting a ‘success’, or a 1 outcome. So your intercept is the probability of 1 or presence of the bird
##Label the incidence as a factor and label it. The assumption is 0 means absent and 1 is present.
island$incidence = factor(island$incidence,
levels = c(0,1),
labels = c("Absent", "Present"))
##Centered Data
island$area.cen = island$area - (mean(island$area))
island$isolation.cen = island$isolation - (mean(island$isolation))
par(mfrow = c(1, 3)) ##Creating a multiple plot area with 1 row and 3 columns
boxplot(area.cen ~ incidence, data = island,
ylab = "Area",
xlab = "Incidence",
main = "Incidence by Centered Area",
col = c("darksalmon", "darkseagreen"))
boxplot(isolation.cen ~ incidence, data = island,
ylab = "Isolation",
xlab = "Incidence",
main = "Incidence by Centered Isolation",
col = c("darksalmon", "darkseagreen"))
boxplot(quality ~ incidence, data = island,
ylab = "Quality",
xlab = "Incidence",
main = "Incidence by Quality",
col = c("darksalmon", "darkseagreen"))
##Initial uncentered model
bird.pres.uncent.glm = glm(incidence ~ area + isolation,
data = island,
family = binomial(link = "logit"))
summary(bird.pres.uncent.glm)
##
## Call:
## glm(formula = incidence ~ area + isolation, family = binomial(link = "logit"),
## data = island)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8189 -0.3089 0.0490 0.3635 2.1192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.6417 2.9218 2.273 0.02302 *
## area 0.5807 0.2478 2.344 0.01909 *
## isolation -1.3719 0.4769 -2.877 0.00401 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.029 on 49 degrees of freedom
## Residual deviance: 28.402 on 47 degrees of freedom
## AIC: 34.402
##
## Number of Fisher Scoring iterations: 6
##Calculating the log odds to odds for the uncentered model
incidence.uncen.odds = exp(coef(bird.pres.uncent.glm))
incidence.uncen.odds
## (Intercept) area isolation
## 766.3669575 1.7873322 0.2536142
##Calculating the uncentered model probability
Intercept.uncent.prob = 766.3669575 /(1+766.3669575 )
Intercept.uncent.prob
## [1] 0.9986968
##Centered Model
bird.pres.glm = glm(incidence ~ area.cen + isolation.cen,
data = island,
family = binomial(link = "logit"))
summary(bird.pres.glm)
##
## Call:
## glm(formula = incidence ~ area.cen + isolation.cen, family = binomial(link = "logit"),
## data = island)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8189 -0.3089 0.0490 0.3635 2.1192
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1154 0.5877 1.898 0.05770 .
## area.cen 0.5807 0.2478 2.344 0.01909 *
## isolation.cen -1.3719 0.4769 -2.877 0.00401 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.029 on 49 degrees of freedom
## Residual deviance: 28.402 on 47 degrees of freedom
## AIC: 34.402
##
## Number of Fisher Scoring iterations: 6
##Convert the coefficients from log odds to odds
incidence.odds = exp(coef(bird.pres.glm))
incidence.odds
## (Intercept) area.cen isolation.cen
## 3.0507967 1.7873322 0.2536142
##To convert from odds to probability the equation is P = Odds/(1+Odds)
Intercept.prob = 3.0507967/(1+3.0507967)
Intercept.prob
## [1] 0.753135
The AIC is 34.4 which can be used to compare this model against other models to determine which model is best for describing this data set.
The intercept for the uncentered model is 766.37 which, is a probability of almost 1 (0.998). This means for that for an island with an area of zero and isolation of zero, there is almost statistical certainty that there will be bids present. This is not necessarily a useful interpretation for the intercept. After re-constructing the model with centered data, the intercept is 3.05 which means there is a probability of 0.75 that there is a presence of birds on an island with an average area and isolation.
The area coefficient is 1.79 which is a multiplicative coefficient that as the area increases by one unit, the odds of presence increase by a rate of 1.79.
The isolation coefficient is 0.25 which is also a multiplicative coefficient that as isolation increases by one unit, the odds of presence decrease by a rate of 0.25. Since this value is less than 1, it indicates that it negatively affects the presence of birds.
##Create a new data frame
##Use the uncentered model as the new data is given in the original scale.
bird.data.new = data.frame(area = 5, isolation = 6)
bird.predict = predict.glm(bird.pres.uncent.glm,
newdata = bird.data.new,
type = 'response',
se.fit = TRUE)
bird.predict
## $fit
## 1
## 0.7881208
##
## $se.fit
## 1
## 0.1125028
##
## $residual.scale
## [1] 1
tsuga = read.csv("../datafiles/tsuga.csv", header = TRUE)
tsuga.glm = glm(cover ~ streamdist + elev,
data = tsuga,
family = poisson(link = 'log'))
summary(tsuga.glm)
##
## Call:
## glm(formula = cover ~ streamdist + elev, family = poisson(link = "log"),
## data = tsuga)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.31395 -0.82155 -0.07929 0.71900 2.62316
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.622e+00 5.226e-02 31.047 < 2e-16 ***
## streamdist -8.963e-04 1.173e-04 -7.641 2.15e-14 ***
## elev 8.901e-05 5.653e-05 1.575 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 748.23 on 744 degrees of freedom
## Residual deviance: 687.10 on 742 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 3150.2
##
## Number of Fisher Scoring iterations: 4
tsuga.glm.coef = exp(coef(tsuga.glm))
tsuga.glm.coef
## (Intercept) streamdist elev
## 5.0652901 0.9991041 1.0000890