Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer.

Step 1: Collect the Data

Check the appropriateness of response variable for LOGISTIC regression: We want to look at the breakdown of successes and failures. It is useful if there is a relatively even number of successes and failures in the data as that will allow for more accurate estimates of the coefficients. (Which lead to more accurate predictions)

Defining the response

\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.

birdkeeping<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/C7%20BirdKeeping.csv",header=T)
head(birdkeeping)
names(birdkeeping)
[1] "Gender"     "Status"     "Age"        "Smoked"     "Cigarettes"
[6] "Bird"       "LungCancer"
# table(response) for counts 
table(birdkeeping$LungCancer)

 0  1 
98 49 

Although this break down is not even, there is a suitable number of observations in each group.

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We will explore the relationship with qualitative variables with mosaic plots and conditional proportions and classify each relationship as existent or not. We explore the box plots and means for each quantitative variable explanatory variable then classify the relationships as existent or not.

#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Bird,birdkeeping$LungCancer)
   
     0  1
  0 64 16
  1 34 33
#Convert contingency table to proportions, margin=2 specifices totalling over the columns
prop.table(table(birdkeeping$Bird,birdkeeping$LungCancer),margin=2)
   
            0         1
  0 0.6530612 0.3265306
  1 0.3469388 0.6734694
#Side by side boxplots and summary stats for quantitative
boxplot(Smoked~LungCancer, data=birdkeeping)

tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)
$`0`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   16.00   25.00   24.96   38.75   47.00 

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   27.00   36.00   33.63   41.00   50.00 

Step 3: Estimate the model parameters

Here we will fit the model with all of the predictors.

NOTE: It is good practice to define what you want to be treated as the base level, or “failure” Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is a string.

# Be sure to clearly define the base level for response
birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"0")

# We want to ensure the explanatory variables are treated as factor
# levels, instead of numeric 1 and 0
# We only do this if a variable meant to be qualitative is read in as numeric
birdkeeping$Gender<-as.factor(birdkeeping$Gender)
birdkeeping$Status<-as.factor(birdkeeping$Status)
birdkeeping$Bird<-as.factor(birdkeeping$Bird)

# you will know a variable is treated as a factor if it outputs as such in the
# model with the level names (or you can check the type in the environment)
# we will specify family=binomial()
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)

Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
Gender1      0.56127    0.53116   1.057 0.290653    
Status1      0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
Bird1        1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5

MODEL:

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

PREDICTION EQUATION:

\(\hat{\pi^*}=-1.937+0.56 Gender+0.105 Status-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36Bird\)

Where:

  • Gender= 1 if female, 0 otherwise
  • Status= 1 if high economic status, 0 otherwise
  • Bird= 1 if own a bird, 0 otherwise

INTERPRETATIONS

# Extract just coefficients (or intervals)
coefficients(birdmod1)
(Intercept)     Gender1     Status1         Age      Smoked  Cigarettes 
-1.93735861  0.56127270  0.10544761 -0.03975542  0.07286848  0.02601689 
      Bird1 
 1.36259456 
confint(birdmod1)
Waiting for profiling to be done...
                  2.5 %     97.5 %
(Intercept) -5.59554204 1.54015158
Gender1     -0.48194383 1.61625027
Status1     -0.82695901 1.02585600
Age         -0.11263170 0.02827386
Smoked       0.02529552 0.13082279
Cigarettes  -0.02407110 0.07680737
Bird1        0.57309984 2.19296212
#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))
(Intercept)     Gender1     Status1         Age      Smoked  Cigarettes 
  0.1440840   1.7529020   1.1112079   0.9610245   1.0755891   1.0263583 
      Bird1 
  3.9063153 
exp(confint(birdmod1))
Waiting for profiling to be done...
                  2.5 %   97.5 %
(Intercept) 0.003714385 4.665297
Gender1     0.617581748 5.034178
Status1     0.437377326 2.789482
Age         0.893479671 1.028677
Smoked      1.025618164 1.139766
Cigarettes  0.976216301 1.079834
Bird1       1.773756897 8.961720

Quantitative Interpretations:

  • For a one unit year in age, the estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer decreases by 0.034, given the other variables are held constant.
  • For a one unit year in age, the estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer decreases by a factor of 0.96, given the other variables are held constant.
  • We are 95% confident that for a one unit year in age, the odds of a person having lung cancer changes by a factor between 0.893 and 1.02, given the other variables are held constant. (Since 1 is in this interval the variable “age” is not a significant predictor.)

Qualitative Interpretations:

  • The estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer is 1.36 higher for those who own a bird, given the other variables are held constant.
  • The estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer is 3.9 times higher for those who own a bird, given the other variables are held constant.
  • We are 95% confident that the odds of a person having lung cancer increases by a factor between 1.77 and 8.96 given the other variables are held constant. (Since 1 is not in this interval “bird” is a significant predictor)

Step 4: Assess the model (Unit 4.2)

#Compute AIC and BIC
AIC(birdmod1)
[1] 168.1984
BIC(birdmod1)
[1] 189.1314
#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod1)

   The Hosmer-Lemeshow goodness-of-fit test

 Group Size Observed  Expected
     1   15        0 0.4319818
     2   15        2 1.3729248
     3   15        4 2.2192168
     4   15        3 3.3585724
     5   15        4 4.1808653
     6   15        3 5.0285541
     7   15        6 6.6485045
     8   15        8 8.3305082
     9   15       11 9.1584956
    10   12        8 8.2703765

         Statistic =  4.85068 
degrees of freedom =  8 
           p-value =  0.77341 
#Metrics of Association
#requires blorr package
blr_pairs(birdmod1)

We will evaluate the logistic regression model with AIC, BIC, Hosmer-Lemeshow Goodness of Fit Test, and metrics of association. Individually interpreting these metrics are not practically useful. But these are more typically used to compare between two or more models to determine which is best.

Step 5: Evaluate the model

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

First we Perform the model adequacy test (Likelihood Ratio Test):

#REPRINT MODEL
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)

Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
Gender1      0.56127    0.53116   1.057 0.290653    
Status1      0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
Bird1        1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5
#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest
[1] 32.93675
#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue
[1] 1.078379e-05
  • Hypotheses:
    • \(H_0: \beta_1=...=\beta_6=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 ,...,\beta_6\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: chi-square with 6 DF
  • Test Statistic: \(\chi^2=32.93675\)
  • Pvalue: <0.0001
  • Decision: 0.0001<0.05 -> REJECT H0
  • Conclusion: The model with these predictors is adequate at predicting the probability of having lung cancer.

Then we test “the most important predictors”:

Test the Significance of Bird (one predictor- Wald Test)

  • Hypotheses:
    • \(H_0: \beta_6=0\)
    • \(H_a:\beta_6 \neq 0\)
  • Distribution of test statistic: Chi Square with 1 DF OR Z
  • Test Statistic: \(\chi^2=3.31^2\) OR Z=3.31
  • Pvalue: 0.000923
  • Decision: 0.000923<0.05 -> REJECT H0
  • Conclusion: Owning a bird contributes to predicting the probability of having lung cancer, given the other variables are present in the model.

Step 6: Check the Residuals

This step will not be evaluated in this course.

blr_plot_difchisq_leverage(birdmod1)

blr_plot_diag_influence(birdmod1)

blr_residual_diagnostics(birdmod1)
plot(birdmod1,which=5)

We are not concerned about the trends in these plots for identifying any outliers. We are looking to identify observations that are “far” from the others.

Step 7: Use the model for prediction

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

#compute pi-hat values. These are probabilities of success for each value.
birdkeeping$fits<-fitted.values(birdmod1)
head(birdkeeping)
#summarize the "correctly" classified predictions
blr_confusion_matrix(birdmod1)
Confusion Matrix and Statistics 

          Reference
Prediction  0  1
         0 80 22
         1 18 27

                Accuracy : 0.7279 
     No Information Rate : 0.6667 

                   Kappa : 0.3750 

McNemars's Test P-Value  : 0.6353 

             Sensitivity : 0.5510 
             Specificity : 0.8163 
          Pos Pred Value : 0.6000 
          Neg Pred Value : 0.7843 
              Prevalence : 0.3333 
          Detection Rate : 0.1837 
    Detection Prevalence : 0.3061 
       Balanced Accuracy : 0.6837 
               Precision : 0.6000 
                  Recall : 0.5510 

        'Positive' Class : 1

We can compare how well our model does at predicting within the data set. For example, observation 1 was actually a “success”, but had a predicted probability of 0.41.

Final note: We can classify “correct” and “incorrect” conclusions. The default cutoff is typically 0.5. Meaning is \(\hat{\pi}\) is greater than 0.5 it will be labeled as a “success”. You can change this threshold depending on the context.