Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer. Read more about this study here:
\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.
The qualitative variables were originally stored as dummy variables (1 and 0). We are going to rename them as factors and qualitative levels in the data table so that they are easier to interpret in our analysis and output.
# table(response) for counts
head(birdkeeping)
table(birdkeeping$LungCancer)
No Yes
98 49
barplot(table(birdkeeping$LungCancer),xlab="Lung Cancer", main="Counts of Response")
Although this break down is not even, there is a suitable number of observations in each group. We have 49 observations in the lower case group, so we would likely not want to include more than about 5 predictors in our model.
We explore the relationships with quantitative variables with box plots and means then classify the relationships as existent or not.
# Age
boxplot(Age~LungCancer, data=birdkeeping)
tapply(birdkeeping$Age,birdkeeping$LungCancer,summary)
$No
Min. 1st Qu. Median Mean 3rd Qu. Max.
37 52 59 57 63 67
$Yes
Min. 1st Qu. Median Mean 3rd Qu. Max.
37.0 52.0 58.0 56.9 63.0 66.0
# Smoked
boxplot(Smoked~LungCancer, data=birdkeeping)
tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)
$No
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 16.00 25.00 24.96 38.75 47.00
$Yes
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 27.00 36.00 33.63 41.00 50.00
# Cigarettes
boxplot(Cigarettes~LungCancer, data=birdkeeping)
tapply(birdkeeping$Cigarettes,birdkeeping$LungCancer,summary)
$No
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 6.00 15.00 14.21 20.00 45.00
$Yes
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 15.00 20.00 18.82 20.00 40.00
We will explore the relationship with qualitative variables with mosaic plots and conditional proportions and classify each relationship as existent or not.
# Gender
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Gender,data=birdkeeping)
#contingency table for counts (table(rows, columns))
table(birdkeeping$Gender,birdkeeping$LungCancer)
No Yes
M 74 37
F 24 12
#Convert contingency table to proportions, margin=2 specifies totaling over the columns
prop.table(table(birdkeeping$Gender,birdkeeping$LungCancer),margin=2)
No Yes
M 0.755102 0.755102
F 0.244898 0.244898
# Status
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Status,data=birdkeeping)
#contingency table for counts (table(rows, columns))
table(birdkeeping$Status,birdkeeping$LungCancer)
No Yes
Low 65 37
High 33 12
#Convert contingency table to proportions, margin=2 specifies totaling over the columns
prop.table(table(birdkeeping$Status,birdkeeping$LungCancer),margin=2)
No Yes
Low 0.6632653 0.7551020
High 0.3367347 0.2448980
# Bird
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)
#contingency table for counts (table(rows, columns))
table(birdkeeping$Bird,birdkeeping$LungCancer)
No Yes
NoBird 64 16
OwnBird 34 33
#Convert contingency table to proportions, margin=2 specifices totalling over the columns
prop.table(table(birdkeeping$Bird,birdkeeping$LungCancer),margin=2)
No Yes
NoBird 0.6530612 0.3265306
OwnBird 0.3469388 0.6734694
We do not see an association between lung cancer and gender. The proportion of males (or the proportion of females) is exactly the same for both those who have and not have lung cancer.
We do not see an association between lung cancer and status. The proportion of high (or the proportion of low) is about the same for both those who have and not have lung cancer.
We see there might be an association with whether a person owned a bird and whether they have lung cancer. We that for those without lung cancer, 35 percent owned a bird. Compared to the 67 percent of those who had lung cancer owning a bird.
We will not explore interactions in this example.
We can also evaluate multicollinearity between the predictors.
We can also use variable screening in logistic regression.
Here we will fit the model with all of the predictors.
NOTE: It is good practice to define what you want to be treated as the base level, or “failure” Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is numeric.
# Be sure to clearly define the base level (failure group) for response
birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"No")
# We can use the . to abbreviate the syntax and fit a model with all
# predictors in the data table.
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)
Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.93736 1.80425 -1.074 0.282924
GenderF 0.56127 0.53116 1.057 0.290653
StatusHigh 0.10545 0.46885 0.225 0.822050
Age -0.03976 0.03548 -1.120 0.262503
Smoked 0.07287 0.02649 2.751 0.005940 **
Cigarettes 0.02602 0.02552 1.019 0.308055
BirdOwnBird 1.36259 0.41128 3.313 0.000923 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 187.14 on 146 degrees of freedom
Residual deviance: 154.20 on 140 degrees of freedom
AIC: 168.2
Number of Fisher Scoring iterations: 5
\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 GenderF+\beta_2 StatusHigh+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6BirdOwnBird\)
\(\hat{\pi^*}=-1.937+0.56 GenderF+0.105 StatusHigh-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36BirdOwnBird\)
Where:
Show that the beta coefficients do not change with whatever form of the model you write.
Consider the model \(\pi^*=ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1x\)
Solve the equation for \(\pi\) to show \(\pi=\dfrac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}\). Show all of your steps. Use the appropriate markdown syntax (it may help to to write it out on paper before writing into the markdown file.)
Show a one unit change in x results in a \(e^\beta\) factor increase in the odds for quantitative variables.
First, consider the model for the log-odds: \(\pi^*=\beta_0+\beta_1x\). Compute \(\pi^*\) for x=0, x=1, and x=2 in terms of the betas. Then determine how much do the log-odds change for a one unit change in x.
Now, consider the model for the odds:\(\dfrac{\pi}{1-\pi}=e^{\beta_0+\beta_1x}\). Compute the odds for x=0, x=1, and x=2 in terms of the betas. Then determine how much do the odds change for a one unit change in x.
Show the interpretation for a dummy variable coefficient.
Assume we have a qualitative variable with 3 levels (A, B, C). Let DummyA = 1 if group A and 0 otherwise and DummyB=1 if group B, 0 otherwise. Group C is the base level.
Consider the model for the odds:\(\dfrac{\pi}{1-\pi}=e^{\beta_0+\beta_1DummyA +\beta_2DummyB}\). Compute the odds group A, B, and C in terms of the betas. Then determine the interpretation of a single beta coefficient associated with a dummy variable.
# Extract just coefficients (or intervals)
coefficients(birdmod1)
(Intercept) GenderF StatusHigh Age Smoked Cigarettes
-1.93735861 0.56127270 0.10544761 -0.03975542 0.07286848 0.02601689
BirdOwnBird
1.36259456
confint(birdmod1, level=.95)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -5.59554204 1.54015158
GenderF -0.48194383 1.61625027
StatusHigh -0.82695901 1.02585600
Age -0.11263170 0.02827386
Smoked 0.02529552 0.13082279
Cigarettes -0.02407110 0.07680737
BirdOwnBird 0.57309984 2.19296212
#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))
(Intercept) GenderF StatusHigh Age Smoked Cigarettes
0.1440840 1.7529020 1.1112079 0.9610245 1.0755891 1.0263583
BirdOwnBird
3.9063153
exp(confint(birdmod1,level=.95))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.003714385 4.665297
GenderF 0.617581748 5.034178
StatusHigh 0.437377326 2.789482
Age 0.893479671 1.028677
Smoked 1.025618164 1.139766
Cigarettes 0.976216301 1.079834
BirdOwnBird 1.773756897 8.961720
Quantitative Interpretations:
On Your Own: Try the interpretations for the “Smoked” variable.
Qualitative Interpretations:
On Your Own: Try the interpretations for the “Status” variable.
For logistic regression we use the hypothesis format of \(H_0: \beta_i=0\). That translates to \(H_0: e^{\beta_i}=1\). Therefore, if and only if the interval of the odds does not contain 1, the interval for the log-odds will not contain 0, and the parameter is significant.
\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)
#REPRINT MODEL
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)
Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.93736 1.80425 -1.074 0.282924
GenderF 0.56127 0.53116 1.057 0.290653
StatusHigh 0.10545 0.46885 0.225 0.822050
Age -0.03976 0.03548 -1.120 0.262503
Smoked 0.07287 0.02649 2.751 0.005940 **
Cigarettes 0.02602 0.02552 1.019 0.308055
BirdOwnBird 1.36259 0.41128 3.313 0.000923 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 187.14 on 146 degrees of freedom
Residual deviance: 154.20 on 140 degrees of freedom
AIC: 168.2
Number of Fisher Scoring iterations: 5
#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest
[1] 32.93675
#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue
[1] 1.078379e-05
birdmod2<-glm(LungCancer~.-Age,data=birdkeeping,family=binomial())
summary(birdmod2)
Call:
glm(formula = LungCancer ~ . - Age, family = binomial(), data = birdkeeping)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.80143 0.82243 -4.622 3.8e-06 ***
GenderF 0.58361 0.52725 1.107 0.268340
StatusHigh 0.03958 0.46451 0.085 0.932102
Smoked 0.05677 0.02033 2.793 0.005229 **
Cigarettes 0.03107 0.02476 1.255 0.209555
BirdOwnBird 1.43560 0.40655 3.531 0.000414 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 187.14 on 146 degrees of freedom
Residual deviance: 155.49 on 141 degrees of freedom
AIC: 167.49
Number of Fisher Scoring iterations: 5
#Compute AIC and BIC
AIC(birdmod2)
[1] 167.4945
BIC(birdmod2)
[1] 185.4371
#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod2)
The Hosmer-Lemeshow goodness-of-fit test
Group Size Observed Expected
1 15 0 0.6212571
2 15 3 1.5580862
3 15 2 2.2771044
4 15 2 3.2741449
5 15 6 4.1286036
6 15 5 4.9967048
7 15 5 6.2345576
8 15 8 8.2309190
9 15 9 9.2164032
10 12 9 8.4622194
Statistic = 4.54343
degrees of freedom = 8
p-value = 0.80507
We will evaluate the logistic regression model with AIC, BIC, and Hosmer-Lemeshow Goodness of Fit Test. Individually interpreting these metrics are not practically useful. But these are more typically used to compare between two or more models to determine which is best.
We will not cover this specifically in this course, but the supplemental material gives you a bit of a background.
\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)
#compute pi-hat values. These are probabilities of success for each value.
birdkeeping$fits<-fitted.values(birdmod2)
head(birdkeeping)
#summarize the "correctly" classified predictions
blr_confusion_matrix(birdmod2)
Confusion Matrix and Statistics
Reference
Prediction No Yes
0 81 23
1 17 26
Accuracy : 0.7279
No Information Rate : 0.6667
Kappa : 0.3684
McNemars's Test P-Value : 0.4292
Sensitivity : 0.5306
Specificity : 0.8265
Pos Pred Value : 0.6047
Neg Pred Value : 0.7788
Prevalence : 0.3333
Detection Rate : 0.1769
Detection Prevalence : 0.2925
Balanced Accuracy : 0.6786
Precision : 0.6047
Recall : 0.5306
'Positive' Class : 1
Final note: We can classify “correct” and “incorrect” conclusions. The default cutoff is typically 0.5. Meaning is \(\hat{\pi}\) if greater than 0.5 it will be labeled as a “success”. You can change this threshold depending on the context.