Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer. Read more about this study here:

Gender: F, M: 1 represents females
Status: High, Low: 1 represents high economic status
Age: Owner’s age
Smoked: number of years the person has smoked
Cigarettes: number of cigarettes smoked per day
Bird: OwnBird, NoBird: 1 represents own a bird
Lung Cancer: 1 represents having lung cancer (response)

Step 1: Collect the Data

Data Compilation: These data were collected from researchers in a reputable journal.
Data Organization: Are the data organized for logistic regression?- each row contains information about one observation (experimental unit) including all of the explanatory predictors and the response in the columns.
Suitability of response variable for LOGISTIC regression: We want to look at the breakdown of successes and failures. It is useful if there is a relatively even number of successes and failures in the data as that will allow for more reliable estimates of the coefficients. (Which lead to more accurate predictions.) We do not have formally strict guidelines for this, but it is recommended to have about 10-20 observations per predictor in your model for the lower prevalence group. The bigger discrepancy in ratio of successes to failures, the larger the sample size is needed. We call this the Event Per Variable criterion (EPV).

Defining the response

\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.

The qualitative variables were originally stored as dummy variables (1 and 0). We are going to rename them as factors and qualitative levels in the data table so that they are easier to interpret in our analysis and output.

# table(response) for counts 
head(birdkeeping)

table(birdkeeping$LungCancer)


 No Yes 
 98  49

barplot(table(birdkeeping$LungCancer),xlab="Lung Cancer", main="Counts of Response")

Although this break down is not even, there is a suitable number of observations in each group. We have 49 observations in the lower case group, so we would likely not want to include more than about 5 predictors in our model.

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

Quantitative Variables

We explore the relationships with quantitative variables with box plots and means then classify the relationships as existent or not.

Age: Owner’s age
Smoked: number of years the person has smoked
Cigarettes: number of cigarettes smoked per day

# Age
boxplot(Age~LungCancer, data=birdkeeping)

tapply(birdkeeping$Age,birdkeeping$LungCancer,summary)

$No
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     37      52      59      57      63      67 

$Yes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   37.0    52.0    58.0    56.9    63.0    66.0

# Smoked
boxplot(Smoked~LungCancer, data=birdkeeping)

tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)

$No
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   16.00   25.00   24.96   38.75   47.00 

$Yes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   27.00   36.00   33.63   41.00   50.00

# Cigarettes
boxplot(Cigarettes~LungCancer, data=birdkeeping)

tapply(birdkeeping$Cigarettes,birdkeeping$LungCancer,summary)

$No
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    6.00   15.00   14.21   20.00   45.00 

$Yes
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   15.00   20.00   18.82   20.00   40.00

We do not see an association with the age of a person smoked and having lung cancer. For this sample, the average age is about 57 years old for both those who had and did not have lung cancer.
We see a slight association with the number of years a person smoked and having lung cancer. On average, those who have lung cancer smoked longer.
We see a slight association with the number of cigarettes a person smoked per day and having lung cancer. On average, those who have lung cancer smoked more cigarettes per day (20, compared to 15)

Qualitative Variables

We will explore the relationship with qualitative variables with mosaic plots and conditional proportions and classify each relationship as existent or not.

Gender: M, F
Status: High, Low
Bird: NoBird, OwnBird

# Gender
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Gender,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Gender,birdkeeping$LungCancer)

#Convert contingency table to proportions, margin=2 specifies totaling over the columns
prop.table(table(birdkeeping$Gender,birdkeeping$LungCancer),margin=2)

   
          No      Yes
  M 0.755102 0.755102
  F 0.244898 0.244898

# Status
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Status,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Status,birdkeeping$LungCancer)

      
       No Yes
  Low  65  37
  High 33  12

#Convert contingency table to proportions, margin=2 specifies totaling over the columns
prop.table(table(birdkeeping$Status,birdkeeping$LungCancer),margin=2)

      
              No       Yes
  Low  0.6632653 0.7551020
  High 0.3367347 0.2448980

# Bird
#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Bird,birdkeeping$LungCancer)

         
          No Yes
  NoBird  64  16
  OwnBird 34  33

#Convert contingency table to proportions, margin=2 specifices totalling over the columns
prop.table(table(birdkeeping$Bird,birdkeeping$LungCancer),margin=2)

         
                 No       Yes
  NoBird  0.6530612 0.3265306
  OwnBird 0.3469388 0.6734694

We do not see an association between lung cancer and gender. The proportion of males (or the proportion of females) is exactly the same for both those who have and not have lung cancer.
We do not see an association between lung cancer and status. The proportion of high (or the proportion of low) is about the same for both those who have and not have lung cancer.
We see there might be an association with whether a person owned a bird and whether they have lung cancer. We that for those without lung cancer, 35 percent owned a bird. Compared to the 67 percent of those who had lung cancer owning a bird.
We will not explore interactions in this example.
We can also evaluate multicollinearity between the predictors.
We can also use variable screening in logistic regression.

Step 3: Estimate the model parameters

Here we will fit the model with all of the predictors.

NOTE: It is good practice to define what you want to be treated as the base level, or “failure” Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is numeric.

# Be sure to clearly define the base level (failure group) for response
birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"No")

# We can use the . to abbreviate the syntax and fit a model with all
# predictors in the data table.
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)


Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
GenderF      0.56127    0.53116   1.057 0.290653    
StatusHigh   0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
BirdOwnBird  1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5

MODEL:

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 GenderF+\beta_2 StatusHigh+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6BirdOwnBird\)

PREDICTION EQUATION:

\(\hat{\pi^*}=-1.937+0.56 GenderF+0.105 StatusHigh-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36BirdOwnBird\)

Where:

GenderF= 1 if female, 0 otherwise
StatusHigh= 1 if high economic status, 0 otherwise
BirdOwnBird= 1 if own a bird, 0 otherwise

Math Interlude

Show that the beta coefficients do not change with whatever form of the model you write.

Consider the model \(\pi^*=ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1x\)

Solve the equation for \(\pi\) to show \(\pi=\dfrac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}\). Show all of your steps. Use the appropriate markdown syntax (it may help to to write it out on paper before writing into the markdown file.)

Show a one unit change in x results in a \(e^\beta\) factor increase in the odds for quantitative variables.

First, consider the model for the log-odds: \(\pi^*=\beta_0+\beta_1x\). Compute \(\pi^*\) for x=0, x=1, and x=2 in terms of the betas. Then determine how much do the log-odds change for a one unit change in x.

Now, consider the model for the odds:\(\dfrac{\pi}{1-\pi}=e^{\beta_0+\beta_1x}\). Compute the odds for x=0, x=1, and x=2 in terms of the betas. Then determine how much do the odds change for a one unit change in x.

Show the interpretation for a dummy variable coefficient.

Assume we have a qualitative variable with 3 levels (A, B, C). Let DummyA = 1 if group A and 0 otherwise and DummyB=1 if group B, 0 otherwise. Group C is the base level.

Consider the model for the odds:\(\dfrac{\pi}{1-\pi}=e^{\beta_0+\beta_1DummyA +\beta_2DummyB}\). Compute the odds group A, B, and C in terms of the betas. Then determine the interpretation of a single beta coefficient associated with a dummy variable.

INTERPRETATIONS

# Extract just coefficients (or intervals)
coefficients(birdmod1)

(Intercept)     GenderF  StatusHigh         Age      Smoked  Cigarettes 
-1.93735861  0.56127270  0.10544761 -0.03975542  0.07286848  0.02601689 
BirdOwnBird 
 1.36259456

confint(birdmod1, level=.95)

Waiting for profiling to be done...

                  2.5 %     97.5 %
(Intercept) -5.59554204 1.54015158
GenderF     -0.48194383 1.61625027
StatusHigh  -0.82695901 1.02585600
Age         -0.11263170 0.02827386
Smoked       0.02529552 0.13082279
Cigarettes  -0.02407110 0.07680737
BirdOwnBird  0.57309984 2.19296212

#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))

(Intercept)     GenderF  StatusHigh         Age      Smoked  Cigarettes 
  0.1440840   1.7529020   1.1112079   0.9610245   1.0755891   1.0263583 
BirdOwnBird 
  3.9063153

exp(confint(birdmod1,level=.95))

Waiting for profiling to be done...

                  2.5 %   97.5 %
(Intercept) 0.003714385 4.665297
GenderF     0.617581748 5.034178
StatusHigh  0.437377326 2.789482
Age         0.893479671 1.028677
Smoked      1.025618164 1.139766
Cigarettes  0.976216301 1.079834
BirdOwnBird 1.773756897 8.961720

Quantitative Interpretations:

\(\hat{\beta_3}\): For a one unit year in age, the estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer decreases by 0.034, given the other variables are held constant.
\(e^{\hat{\beta_3}}\): For a one unit year in age, the estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer decreases by a factor of 0.96, given the other variables are held constant.
Confidence interval for \(e^{\beta_3}\): We are 95% confident that for a one unit year in age, the odds of a person having lung cancer changes by a factor between 0.893 and 1.02, given the other variables are held constant. (Since 1 is in this interval the variable “age” is not a significant predictor.)

On Your Own: Try the interpretations for the “Smoked” variable.

Qualitative Interpretations:

\(\hat{\beta_6}\): The estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer is 1.36 higher for those who own a bird, given the other variables are held constant.
\(e^{\hat{\beta_6}}\): The estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer is 3.9 times higher for those who own a bird, given the other variables are held constant.
Confidence interval for \(e^{\beta_3}\): We are 95% confident that the odds of a person having lung cancer increases by a factor between 1.77 and 8.96 given the other variables are held constant. (Since 1 is not in this interval “bird” is a significant predictor)

On Your Own: Try the interpretations for the “Status” variable.

Notes on Intervals and Significance

For logistic regression we use the hypothesis format of \(H_0: \beta_i=0\). That translates to \(H_0: e^{\beta_i}=1\). Therefore, if and only if the interval of the odds does not contain 1, the interval for the log-odds will not contain 0, and the parameter is significant.

Step 4: Evaluate the model

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

First we Perform the model adequacy test (Likelihood Ratio Test):

#REPRINT MODEL
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)


Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
GenderF      0.56127    0.53116   1.057 0.290653    
StatusHigh   0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
BirdOwnBird  1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5

#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest

[1] 32.93675

#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue

[1] 1.078379e-05

Hypotheses:
- \(H_0: \beta_1=...=\beta_6=0\) (the model is not adequate)
- \(H_a\):at least one of \(\beta_1 ,...,\beta_6\) does not equal 0 (the model is adequate)
Distribution of test statistic: chi-square with 6 DF
Test Statistic: \(\chi^2=187.14-154.20=32.93675\)
Pvalue: <0.0001
Decision: 0.0001<0.05 -> REJECT H0
Conclusion: The model with these predictors is adequate at predicting the probability of having lung cancer.

Then we test “the most important predictors”:

Test the Significance of Bird (one predictor- Wald Test)

Hypotheses:
- \(H_0: \beta_6=0\)
- \(H_a:\beta_6 \neq 0\)
Distribution of test statistic: Chi Square with 1 DF OR Z
Test Statistic: \(\chi^2=3.31^2\) OR Z=3.31
Pvalue: 0.000923
Decision: 0.000923<0.05 -> REJECT H0
Conclusion: Owning a bird contributes to predicting the probability of having lung cancer, given the other variables are present in the model. (Additionally, since 1 was not in the interval for the odds “bird” is a significant predictor.)

Test the Significance of Age (one predictor- Wald Test)

Hypotheses:
- \(H_0: \beta_3=0\)
- \(H_a:\beta_3 \neq 0\)
Distribution of test statistic: Chi Square with 1 DF OR Z
Test Statistic: \(\chi^2=|-1.2544|^2\) OR Z=-1.120
Pvalue: 0.262503
Decision: 0.262503<0.05 -> FAIL TO REJECT H0
Conclusion: Age does not contribute to predicting the probability of having lung cancer, given the other variables are present in the model. (Additionally, since 1 was in the interval for the odds “age” is not a significant predictor.) We will remove it and refit the model.

birdmod2<-glm(LungCancer~.-Age,data=birdkeeping,family=binomial())
summary(birdmod2)


Call:
glm(formula = LungCancer ~ . - Age, family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.80143    0.82243  -4.622  3.8e-06 ***
GenderF      0.58361    0.52725   1.107 0.268340    
StatusHigh   0.03958    0.46451   0.085 0.932102    
Smoked       0.05677    0.02033   2.793 0.005229 ** 
Cigarettes   0.03107    0.02476   1.255 0.209555    
BirdOwnBird  1.43560    0.40655   3.531 0.000414 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 155.49  on 141  degrees of freedom
AIC: 167.49

Number of Fisher Scoring iterations: 5

Test the Significance of smoking related variables (two predictors- drop in deviance test)

Complete model: \(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 GenderF+\beta_2 StatusHigh+\beta_3Smoked +\beta_4Cigarettes+\beta_5Bird\)
Hypotheses:
- \(H_0: \beta_3=\beta_4=0\)
- \(H_a:\beta_3=\beta_4 \neq 0\)
- reduced model: \(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 GenderF+\beta_2 StatusHigh+\beta_5Bird\)

#Fit the reduced model
birdmodred<-glm(LungCancer~.-Age-Smoked-Cigarettes,data=birdkeeping,family=binomial())
summary(birdmodred)


Call:
glm(formula = LungCancer ~ . - Age - Smoked - Cigarettes, family = binomial(), 
    data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.2412     0.3299  -3.762 0.000169 ***
GenderF      -0.3409     0.4412  -0.773 0.439743    
StatusHigh   -0.2419     0.4252  -0.569 0.569476    
BirdOwnBird   1.3734     0.3836   3.580 0.000343 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 172.11  on 143  degrees of freedom
AIC: 180.11

Number of Fisher Scoring iterations: 4

#Perform drop in deviance test
#anova(reduced,complete,test="Chisq")
anova(birdmodred, birdmod2, test = "Chisq")

Distribution of test statistic: Chi Square with 2 DF
Test Statistic: \(\chi^2=16.62\)
Pvalue: 0.000246
Decision: 0.000246<0.05 -> REJECT H0
Conclusion: The smoking related variables (number of years smoked, and number of cigarettes smoked per day) contributes to predicting the probability of having lung cancer, given the other variables are present in the model.

That will conclude our variable testing for this example.

Step 5: Assess the model (Unit 4.2)

#Compute AIC and BIC
AIC(birdmod2)

[1] 167.4945

BIC(birdmod2)

[1] 185.4371

#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod2)


   The Hosmer-Lemeshow goodness-of-fit test

 Group Size Observed  Expected
     1   15        0 0.6212571
     2   15        3 1.5580862
     3   15        2 2.2771044
     4   15        2 3.2741449
     5   15        6 4.1286036
     6   15        5 4.9967048
     7   15        5 6.2345576
     8   15        8 8.2309190
     9   15        9 9.2164032
    10   12        9 8.4622194

         Statistic =  4.54343 
degrees of freedom =  8 
           p-value =  0.80507

We will evaluate the logistic regression model with AIC, BIC, and Hosmer-Lemeshow Goodness of Fit Test. Individually interpreting these metrics are not practically useful. But these are more typically used to compare between two or more models to determine which is best.

AIC = 167.49
BIC = 185.44
Hosmer-Lemeshow Test:
- \(H_0\): The logistic regression model provides an adequate fit to the data
- With a p-value of 0.805 , we fail to reject the null hypothesis. There is not enough evidence to determine the model does not provide a good fit. We can move forward in the analysis.

Step 6: Check Residuals and Assumptions

We will not cover this specifically in this course, but the supplemental material gives you a bit of a background.

Step 7: Use the model for prediction