Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer.
Check the appropriateness of response variable for LOGISTIC regression: We want to look at the breakdown of successes and failures. It is useful if there is a relatively even number of successes and failures in the data as that will allow for more accurate estimates of the coefficients. (Which lead to more accurate predictions)
\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.
birdkeeping<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/C7%20BirdKeeping.csv",header=T)
head(birdkeeping)
names(birdkeeping)
[1] "Gender" "Status" "Age" "Smoked" "Cigarettes"
[6] "Bird" "LungCancer"
# table(response) for counts
table(birdkeeping$LungCancer)
0 1
98 49
Although this break down is not even, there is a suitable number of observations in each group.
We will explore the relationship with qualitative variables with mosaic plots and conditional proportions and classify each relationship as existent or not. We explore the box plots and means for each quantitative variable explanatory variable then classify the relationships as existent or not.
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)
table(birdkeeping$Bird,birdkeeping$LungCancer)
0 1
0 64 16
1 34 33
prop.table(table(birdkeeping$Bird,birdkeeping$LungCancer),margin=2)
0 1
0 0.6530612 0.3265306
1 0.3469388 0.6734694
#Side by side boxplots and summary stats for quantitative
boxplot(Smoked~LungCancer, data=birdkeeping)
tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 16.00 25.00 24.96 38.75 47.00
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 27.00 36.00 33.63 41.00 50.00
Here we will fit the model with all of the predictors.
NOTE: It is good practice to define what you want to be treated as the base level, or “failure” Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is a string.
# Be sure to clearly define the base level for response
birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"0")
# We want to ensure the explanatory variables are treated as factor
# levels, instead of numeric 1 and 0
birdkeeping$Gender<-as.factor(birdkeeping$Gender)
birdkeeping$Status<-as.factor(birdkeeping$Status)
birdkeeping$Bird<-as.factor(birdkeeping$Bird)
# you will know a variable is treated as a factor if it outputs as such in the
# model with the level names (or you can check the type in the environment)
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)
Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.93736 1.80425 -1.074 0.282924
Gender1 0.56127 0.53116 1.057 0.290653
Status1 0.10545 0.46885 0.225 0.822050
Age -0.03976 0.03548 -1.120 0.262503
Smoked 0.07287 0.02649 2.751 0.005940 **
Cigarettes 0.02602 0.02552 1.019 0.308055
Bird1 1.36259 0.41128 3.313 0.000923 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 187.14 on 146 degrees of freedom
Residual deviance: 154.20 on 140 degrees of freedom
AIC: 168.2
Number of Fisher Scoring iterations: 5
\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)
\(\hat{\pi^*}=-1.937+0.56 Gender+0.105 Status-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36Bird\)
# Extract just coefficients (or intervals)
coefficients(birdmod1)
(Intercept) Gender1 Status1 Age Smoked Cigarettes
-1.93735861 0.56127270 0.10544761 -0.03975542 0.07286848 0.02601689
Bird1
1.36259456
confint(birdmod1)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -5.59554204 1.54015158
Gender1 -0.48194383 1.61625027
Status1 -0.82695901 1.02585600
Age -0.11263170 0.02827386
Smoked 0.02529552 0.13082279
Cigarettes -0.02407110 0.07680737
Bird1 0.57309984 2.19296212
#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))
(Intercept) Gender1 Status1 Age Smoked Cigarettes
0.1440840 1.7529020 1.1112079 0.9610245 1.0755891 1.0263583
Bird1
3.9063153
exp(confint(birdmod1))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.003714385 4.665297
Gender1 0.617581748 5.034178
Status1 0.437377326 2.789482
Age 0.893479671 1.028677
Smoked 1.025618164 1.139766
Cigarettes 0.976216301 1.079834
Bird1 1.773756897 8.961720
Quantitative Interpretations:
Qualitative Interpretations:
#Compute AIC and BIC
AIC(birdmod1)
[1] 168.1984
BIC(birdmod1)
[1] 189.1314
#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod1)
The Hosmer-Lemeshow goodness-of-fit test
Group Size Observed Expected
1 15 0 0.4319818
2 15 2 1.3729248
3 15 4 2.2192168
4 15 3 3.3585724
5 15 4 4.1808653
6 15 3 5.0285541
7 15 6 6.6485045
8 15 8 8.3305082
9 15 11 9.1584956
10 12 8 8.2703765
Statistic = 4.85068
degrees of freedom = 8
p-value = 0.77341
#Metrics of Association
#requires blorr package
blr_pairs(birdmod1)
We will evaluate the logistic regression model with AIC, BIC, Hosmer-Lemeshow Goodness of Fit Test, and metrics of association. Individually interpreting these metrics are not practically useful. But these are more typically used to compare between two or more models to determine which is best.
\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)
#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest
[1] 32.93675
#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue
[1] 1.078379e-05
blr_plot_difchisq_leverage(birdmod1)
blr_plot_diag_influence(birdmod1)
blr_residual_diagnostics(birdmod1)
plot(birdmod1,which=5)
We are not concerned about the trends in these plots for identifying any outliers. We are looking to identify observations that are “far” from the others.
#compute pi-hat values. These are probabilities of success for each value.
birdkeeping$fits<-fitted.values(birdmod1)
head(birdkeeping)
#summarize the "correctly" classified predictions
blr_confusion_matrix(birdmod1)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 80 22
1 18 27
Accuracy : 0.7279
No Information Rate : 0.6667
Kappa : 0.3750
McNemars's Test P-Value : 0.6353
Sensitivity : 0.5510
Specificity : 0.8163
Pos Pred Value : 0.6000
Neg Pred Value : 0.7843
Prevalence : 0.3333
Detection Rate : 0.1837
Detection Prevalence : 0.3061
Balanced Accuracy : 0.6837
Precision : 0.6000
Recall : 0.5510
'Positive' Class : 1
We can compare how well our model does at predicting within the data set. For example, observation 1 was actually a “success”, but had a predicted probability of 0.41.
Final note: We can classify “correct” and “incorrect” conclusions. The default cutoff is typically 0.5. Meaning is \(\hat{\pi}\) is greater than 0.5 it will be labeled as a “success”. You can change this threshold depending on the context.