Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer.
Check the appropriateness of response variable for LOGISTIC regression: We want to look at the breakdown of successes and failures. It is useful if there is a relatively even number of successes and failures in the data as that will allow for more accurate estimates of the coefficients. (Which lead to more accurate predictions)
\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.
birdkeeping<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/C7%20BirdKeeping.csv",header=T)
head(birdkeeping)
names(birdkeeping)
[1] "Gender" "Status" "Age" "Smoked" "Cigarettes"
[6] "Bird" "LungCancer"
table(birdkeeping$LungCancer)
0 1
98 49
Although this break down is not even, there is a suitable number of observations in each group
We will explore the relationship with qualitative variables with mosaic plots and proportion and classify each relationship as existent or not. We explore the box plots and means for each quantitative variable explanatory variable then classify the relationships as existent or not.
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)
prop.table(table(birdkeeping$LungCancer,birdkeeping$Bird),margin=2)
0 1
0 0.8000000 0.5074627
1 0.2000000 0.4925373
#Side by side boxplots and summary stats for quantitative
boxplot(Smoked~LungCancer, data=birdkeeping)
tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)
$`0`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 16.00 25.00 24.96 38.75 47.00
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 27.00 36.00 33.63 41.00 50.00
Here we will fit the model with all of the predictors.
NOTE: It is good practice to define what you want to be treated as the base level Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is a string.
birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"0")
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)
Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.93736 1.80425 -1.074 0.282924
Gender 0.56127 0.53116 1.057 0.290653
Status 0.10545 0.46885 0.225 0.822050
Age -0.03976 0.03548 -1.120 0.262503
Smoked 0.07287 0.02649 2.751 0.005940 **
Cigarettes 0.02602 0.02552 1.019 0.308055
Bird 1.36259 0.41128 3.313 0.000923 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 187.14 on 146 degrees of freedom
Residual deviance: 154.20 on 140 degrees of freedom
AIC: 168.2
Number of Fisher Scoring iterations: 5
\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)
\(\hat{\pi^*}=-1.937+0.56 Gender+0.105 Status-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36Bird\)
# Extract just coefficients (or intervals)
coefficients(birdmod1)
(Intercept) Gender Status Age Smoked Cigarettes
-1.93735861 0.56127270 0.10544761 -0.03975542 0.07286848 0.02601689
Bird
1.36259456
confint(birdmod1)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -5.59554204 1.54015158
Gender -0.48194383 1.61625027
Status -0.82695901 1.02585600
Age -0.11263170 0.02827386
Smoked 0.02529552 0.13082279
Cigarettes -0.02407110 0.07680737
Bird 0.57309984 2.19296212
#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))
(Intercept) Gender Status Age Smoked Cigarettes
0.1440840 1.7529020 1.1112079 0.9610245 1.0755891 1.0263583
Bird
3.9063153
exp(confint(birdmod1))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.003714385 4.665297
Gender 0.617581748 5.034178
Status 0.437377326 2.789482
Age 0.893479671 1.028677
Smoked 1.025618164 1.139766
Cigarettes 0.976216301 1.079834
Bird 1.773756897 8.961720
Quantitative Interpretations:
Qualitative Interpretations:
#Compute AIC and BIC
AIC(birdmod1)
[1] 168.1984
BIC(birdmod1)
[1] 189.1314
#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod1)
The Hosmer-Lemeshow goodness-of-fit test
Group Size Observed Expected
1 15 0 0.4319818
2 15 2 1.3729248
3 15 4 2.2192168
4 15 3 3.3585724
5 15 4 4.1808653
6 15 3 5.0285541
7 15 6 6.6485045
8 15 8 8.3305082
9 15 11 9.1584956
10 12 8 8.2703765
Statistic = 4.85068
degrees of freedom = 8
p-value = 0.77341
#Metrics of Association
#requires blorr package
blr_pairs(birdmod1)
We will evaluate the logistic regression model with AIC, BIC, Hosmer-Lemeshow Goodness of Fit Test, and metrics of association. Individually these metrics are not useful. But these are more typically used to compare between two or more models to determine which is best.
\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)
#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest
[1] 32.93675
#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue
[1] 1.078379e-05
Test the Significance of Bird (one predictor- Wald Test)
Test the Significance of smoking related variables (two predictors- drop in deviance test)
#Fit the reduced model
birdmod2<-glm(LungCancer~.-Smoked-Cigarettes,data=birdkeeping,family=binomial())
summary(birdmod2)
Call:
glm(formula = LungCancer ~ . - Smoked - Cigarettes, family = binomial(),
data = birdkeeping)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.01412 1.59950 -1.259 0.207951
Gender -0.28884 0.45248 -0.638 0.523245
Status -0.24848 0.42613 -0.583 0.559822
Age 0.01313 0.02653 0.495 0.620557
Bird 1.40133 0.38831 3.609 0.000308 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 187.14 on 146 degrees of freedom
Residual deviance: 171.87 on 142 degrees of freedom
AIC: 181.87
Number of Fisher Scoring iterations: 4
#Perform drop in deviance test
anova(birdmod2, birdmod1, test = "Chisq")
That will conclude our variable testing for this example.
blr_plot_difchisq_leverage(birdmod1)
blr_plot_diag_influence(birdmod1)
blr_residual_diagnostics(birdmod1)
plot(birdmod1,which=5)
We are not concerned about the trends in these plots for identifying any outliers.
#compute pi-hat values. These are probabilities of success for each value.
birdkeeping$fits<-fitted.values(birdmod1)
head(birdkeeping)
#summarize the "correctly" classified predictions
blr_confusion_matrix(birdmod1)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 80 22
1 18 27
Accuracy : 0.7279
No Information Rate : 0.6667
Kappa : 0.3750
McNemars's Test P-Value : 0.6353
Sensitivity : 0.5510
Specificity : 0.8163
Pos Pred Value : 0.6000
Neg Pred Value : 0.7843
Prevalence : 0.3333
Detection Rate : 0.1837
Detection Prevalence : 0.3061
Balanced Accuracy : 0.6837
Precision : 0.6000
Recall : 0.5510
'Positive' Class : 1
We can compare how well our model does at predicting within the data set. For example, observation 1 was a “success”, but had a predicted probability of 0.41.
Final note: We can classify “correct” and “incorrect” conclusions. the default cutoff is typically 0.5. - sensitivity is the ability of a test to correctly identify a success (true positive rate) - specificity is the ability of the test to correctly identify a failure (true negative rate) - accuracy overall the percentage of correctly identified results - McNemars’s Test has a null that there is no difference between the proportion of false positives and false negatives (good thing). A pvalue greater than alpha our model doesn’t over correct in either direction.