Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer. Read more about this study here:

Step 1: Collect the Data

Defining the response

\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.

The qualitative variables were originally stored as dummy variables (1 and 0). We are going to rename them as factors and qualitative levels in the data table so that they are easier to interpret in our analysis and output.

# table(response) for counts 
head(birdkeeping)
table(birdkeeping$LungCancer)

 No Yes 
 98  49 
barplot(table(birdkeeping$LungCancer),xlab="Lung Cancer", main="Counts of Response")

Although this break down is not even, there is a suitable number of observations in each group. We have 49 observations in the lower case group, so we would likely not want to include more than about 5 predictors in our model.

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

Quantitative Variables

We explore the relationships with quantitative variables with box plots and means then classify the relationships as existent or not.

  • Age: Owner’s age
  • Smoked: number of years the person has smoked
  • Cigarettes: number of cigarettes smoked per day
# Age
boxplot(Age~LungCancer, data=birdkeeping)

tapply(birdkeeping$Age,birdkeeping$LungCancer,summary)
$No
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     37      52      59      57      63      67 

$Yes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   37.0    52.0    58.0    56.9    63.0    66.0 
# Smoked
boxplot(Smoked~LungCancer, data=birdkeeping)

tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)
$No
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   16.00   25.00   24.96   38.75   47.00 

$Yes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   27.00   36.00   33.63   41.00   50.00 
# Cigarettes
boxplot(Cigarettes~LungCancer, data=birdkeeping)

tapply(birdkeeping$Cigarettes,birdkeeping$LungCancer,summary)
$No
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    6.00   15.00   14.21   20.00   45.00 

$Yes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   15.00   20.00   18.82   20.00   40.00 
  • We do not see an association with the age of a person smoked and having lung cancer. For this sample, the average age is about 57 years old for both those who had and did not have lung cancer.
  • We see a slight association with the number of years a person smoked and having lung cancer. On average, those who have lung cancer smoked longer.
  • We see a slight association with the number of cigarettes a person smoked per day and having lung cancer. On average, those who have lung cancer smoked more cigarettes per day (20, compared to 15)

Qualitative Variables

We will explore the relationship with qualitative variables with mosaic plots and conditional proportions and classify each relationship as existent or not.

  • Gender: M, F
  • Status: High, Low
  • Bird: NoBird, OwnBird
# Gender
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Gender,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Gender,birdkeeping$LungCancer)
   
    No Yes
  M 74  37
  F 24  12
#Convert contingency table to proportions, margin=2 specifies totaling over the columns
prop.table(table(birdkeeping$Gender,birdkeeping$LungCancer),margin=2)
   
          No      Yes
  M 0.755102 0.755102
  F 0.244898 0.244898
# Status
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Status,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Status,birdkeeping$LungCancer)
      
       No Yes
  Low  65  37
  High 33  12
#Convert contingency table to proportions, margin=2 specifies totaling over the columns
prop.table(table(birdkeeping$Status,birdkeeping$LungCancer),margin=2)
      
              No       Yes
  Low  0.6632653 0.7551020
  High 0.3367347 0.2448980
# Bird
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Bird,birdkeeping$LungCancer)
         
          No Yes
  NoBird  64  16
  OwnBird 34  33
#Convert contingency table to proportions, margin=2 specifices totalling over the columns
prop.table(table(birdkeeping$Bird,birdkeeping$LungCancer),margin=2)
         
                 No       Yes
  NoBird  0.6530612 0.3265306
  OwnBird 0.3469388 0.6734694
  • We do not see an association between lung cancer and gender. The proportion of males (or the proportion of females) is exactly the same for both those who have and not have lung cancer.

  • We do not see an association between lung cancer and status. The proportion of high (or the proportion of low) is about the same for both those who have and not have lung cancer.

  • We see there might be an association with whether a person owned a bird and whether they have lung cancer. We that for those without lung cancer, 35 percent owned a bird. Compared to the 67 percent of those who had lung cancer owning a bird.

  • We will not explore interactions in this example.

  • We can also evaluate multicollinearity between the predictors.

  • We can also use variable screening in logistic regression.

Step 3: Estimate the model parameters

Here we will fit the model with all of the predictors.

NOTE: It is good practice to define what you want to be treated as the base level, or “failure” Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is numeric.

# Be sure to clearly define the base level (failure group) for response
birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"No")

# We can use the . to abbreviate the syntax and fit a model with all
# predictors in the data table.
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)

Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
GenderF      0.56127    0.53116   1.057 0.290653    
StatusHigh   0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
BirdOwnBird  1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5

MODEL:

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 GenderF+\beta_2 StatusHigh+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6BirdOwnBird\)

PREDICTION EQUATION:

\(\hat{\pi^*}=-1.937+0.56 GenderF+0.105 StatusHigh-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36BirdOwnBird\)

Where:

  • GenderF= 1 if female, 0 otherwise
  • StatusHigh= 1 if high economic status, 0 otherwise
  • BirdOwnBird= 1 if own a bird, 0 otherwise

Math Interlude

Show that the beta coefficients do not change with whatever form of the model you write.

Consider the model \(\pi^*=ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1x\)

Solve the equation for \(\pi\) to show \(\pi=\dfrac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}\). Show all of your steps. Use the appropriate markdown syntax (it may help to to write it out on paper before writing into the markdown file.)

Show a one unit change in x results in a \(e^\beta\) factor increase in the odds for quantitative variables.

First, consider the model for the log-odds: \(\pi^*=\beta_0+\beta_1x\). Compute \(\pi^*\) for x=0, x=1, and x=2 in terms of the betas. Then determine how much do the log-odds change for a one unit change in x.

Now, consider the model for the odds:\(\dfrac{\pi}{1-\pi}=e^{\beta_0+\beta_1x}\). Compute the odds for x=0, x=1, and x=2 in terms of the betas. Then determine how much do the odds change for a one unit change in x.

Show the interpretation for a dummy variable coefficient.

Assume we have a qualitative variable with 3 levels (A, B, C). Let DummyA = 1 if group A and 0 otherwise and DummyB=1 if group B, 0 otherwise. Group C is the base level.

Consider the model for the odds:\(\dfrac{\pi}{1-\pi}=e^{\beta_0+\beta_1DummyA +\beta_2DummyB}\). Compute the odds group A, B, and C in terms of the betas. Then determine the interpretation of a single beta coefficient associated with a dummy variable.

INTERPRETATIONS

# Extract just coefficients (or intervals)
coefficients(birdmod1)
(Intercept)     GenderF  StatusHigh         Age      Smoked  Cigarettes 
-1.93735861  0.56127270  0.10544761 -0.03975542  0.07286848  0.02601689 
BirdOwnBird 
 1.36259456 
confint(birdmod1, level=.95)
Waiting for profiling to be done...
                  2.5 %     97.5 %
(Intercept) -5.59554204 1.54015158
GenderF     -0.48194383 1.61625027
StatusHigh  -0.82695901 1.02585600
Age         -0.11263170 0.02827386
Smoked       0.02529552 0.13082279
Cigarettes  -0.02407110 0.07680737
BirdOwnBird  0.57309984 2.19296212
#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))
(Intercept)     GenderF  StatusHigh         Age      Smoked  Cigarettes 
  0.1440840   1.7529020   1.1112079   0.9610245   1.0755891   1.0263583 
BirdOwnBird 
  3.9063153 
exp(confint(birdmod1,level=.95))
Waiting for profiling to be done...
                  2.5 %   97.5 %
(Intercept) 0.003714385 4.665297
GenderF     0.617581748 5.034178
StatusHigh  0.437377326 2.789482
Age         0.893479671 1.028677
Smoked      1.025618164 1.139766
Cigarettes  0.976216301 1.079834
BirdOwnBird 1.773756897 8.961720

Quantitative Interpretations:

  • \(\hat{\beta_3}\): For a one unit year in age, the estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer decreases by 0.034, given the other variables are held constant.
  • \(e^{\hat{\beta_3}}\): For a one unit year in age, the estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer decreases by a factor of 0.96, given the other variables are held constant.
  • Confidence interval for \(e^{\beta_3}\): We are 95% confident that for a one unit year in age, the odds of a person having lung cancer changes by a factor between 0.893 and 1.02, given the other variables are held constant. (Since 1 is in this interval the variable “age” is not a significant predictor.)

On Your Own: Try the interpretations for the “Smoked” variable.

Qualitative Interpretations:

  • \(\hat{\beta_6}\): The estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer is 1.36 higher for those who own a bird, given the other variables are held constant.
  • \(e^{\hat{\beta_6}}\): The estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer is 3.9 times higher for those who own a bird, given the other variables are held constant.
  • Confidence interval for \(e^{\beta_3}\): We are 95% confident that the odds of a person having lung cancer increases by a factor between 1.77 and 8.96 given the other variables are held constant. (Since 1 is not in this interval “bird” is a significant predictor)

On Your Own: Try the interpretations for the “Status” variable.

Notes on Intervals and Significance

For logistic regression we use the hypothesis format of \(H_0: \beta_i=0\). That translates to \(H_0: e^{\beta_i}=1\). Therefore, if and only if the interval of the odds does not contain 1, the interval for the log-odds will not contain 0, and the parameter is significant.

Step 4: Evaluate the model

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

First we Perform the model adequacy test (Likelihood Ratio Test):

#REPRINT MODEL
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)

Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
GenderF      0.56127    0.53116   1.057 0.290653    
StatusHigh   0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
BirdOwnBird  1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5
#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest
[1] 32.93675
#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue
[1] 1.078379e-05
  • Hypotheses:
    • \(H_0: \beta_1=...=\beta_6=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 ,...,\beta_6\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: chi-square with 6 DF
  • Test Statistic: \(\chi^2=187.14-154.20=32.93675\)
  • Pvalue: <0.0001
  • Decision: 0.0001<0.05 -> REJECT H0
  • Conclusion: The model with these predictors is adequate at predicting the probability of having lung cancer.

Then we test “the most important predictors”:

Test the Significance of Bird (one predictor- Wald Test)

  • Hypotheses:
    • \(H_0: \beta_6=0\)
    • \(H_a:\beta_6 \neq 0\)
  • Distribution of test statistic: Chi Square with 1 DF OR Z
  • Test Statistic: \(\chi^2=3.31^2\) OR Z=3.31
  • Pvalue: 0.000923
  • Decision: 0.000923<0.05 -> REJECT H0
  • Conclusion: Owning a bird contributes to predicting the probability of having lung cancer, given the other variables are present in the model. (Additionally, since 1 was not in the interval for the odds “bird” is a significant predictor.)

Test the Significance of Age (one predictor- Wald Test)

  • Hypotheses:
    • \(H_0: \beta_3=0\)
    • \(H_a:\beta_3 \neq 0\)
  • Distribution of test statistic: Chi Square with 1 DF OR Z
  • Test Statistic: \(\chi^2=|-1.2544|^2\) OR Z=-1.120
  • Pvalue: 0.262503
  • Decision: 0.262503<0.05 -> FAIL TO REJECT H0
  • Conclusion: Age does not contribute to predicting the probability of having lung cancer, given the other variables are present in the model. (Additionally, since 1 was in the interval for the odds “age” is not a significant predictor.) We will remove it and refit the model.
birdmod2<-glm(LungCancer~.-Age,data=birdkeeping,family=binomial())
summary(birdmod2)

Call:
glm(formula = LungCancer ~ . - Age, family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.80143    0.82243  -4.622  3.8e-06 ***
GenderF      0.58361    0.52725   1.107 0.268340    
StatusHigh   0.03958    0.46451   0.085 0.932102    
Smoked       0.05677    0.02033   2.793 0.005229 ** 
Cigarettes   0.03107    0.02476   1.255 0.209555    
BirdOwnBird  1.43560    0.40655   3.531 0.000414 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 155.49  on 141  degrees of freedom
AIC: 167.49

Number of Fisher Scoring iterations: 5

Step 5: Assess the model (Unit 4.2)

#Compute AIC and BIC
AIC(birdmod2)
[1] 167.4945
BIC(birdmod2)
[1] 185.4371
#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod2)

   The Hosmer-Lemeshow goodness-of-fit test

 Group Size Observed  Expected
     1   15        0 0.6212571
     2   15        3 1.5580862
     3   15        2 2.2771044
     4   15        2 3.2741449
     5   15        6 4.1286036
     6   15        5 4.9967048
     7   15        5 6.2345576
     8   15        8 8.2309190
     9   15        9 9.2164032
    10   12        9 8.4622194

         Statistic =  4.54343 
degrees of freedom =  8 
           p-value =  0.80507 

We will evaluate the logistic regression model with AIC, BIC, and Hosmer-Lemeshow Goodness of Fit Test. Individually interpreting these metrics are not practically useful. But these are more typically used to compare between two or more models to determine which is best.

Step 6: Check Residuals and Assumptions

We will not cover this specifically in this course, but the supplemental material gives you a bit of a background.

Step 7: Use the model for prediction

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

#compute pi-hat values. These are probabilities of success for each value.
birdkeeping$fits<-fitted.values(birdmod2)

head(birdkeeping)
#summarize the "correctly" classified predictions
blr_confusion_matrix(birdmod2)
Confusion Matrix and Statistics 

          Reference
Prediction No Yes
         0 81  23
         1 17  26

                Accuracy : 0.7279 
     No Information Rate : 0.6667 

                   Kappa : 0.3684 

McNemars's Test P-Value  : 0.4292 

             Sensitivity : 0.5306 
             Specificity : 0.8265 
          Pos Pred Value : 0.6047 
          Neg Pred Value : 0.7788 
              Prevalence : 0.3333 
          Detection Rate : 0.1769 
    Detection Prevalence : 0.2925 
       Balanced Accuracy : 0.6786 
               Precision : 0.6047 
                  Recall : 0.5306 

        'Positive' Class : 1

Final note: We can classify “correct” and “incorrect” conclusions. The default cutoff is typically 0.5. Meaning is \(\hat{\pi}\) if greater than 0.5 it will be labeled as a “success”. You can change this threshold depending on the context.