Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer.

Gender: 1 represents females
Status: 1 represents high economic status
Age: Owner’s age
Smoked: number of years the person has smoked
Cigarettes: number of cigarettes smoked per day
Bird: 1 represents own a bird
Lung Cancer: 1 represents having lung cancer

Step 1: Collect the Data

Check the appropriateness of response variable for LOGISTIC regression: We want to look at the breakdown of successes and failures. It is useful if there is a relatively even number of successes and failures in the data as that will allow for more accurate estimates of the coefficients. (Which lead to more accurate predictions)

Defining the response

\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.

birdkeeping<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/C7%20BirdKeeping.csv",header=T)
head(birdkeeping)

names(birdkeeping)

[1] "Gender"     "Status"     "Age"        "Smoked"     "Cigarettes"
[6] "Bird"       "LungCancer"

# table(response) for counts 
table(birdkeeping$LungCancer)


 0  1 
98 49

Although this break down is not even, there is a suitable number of observations in each group.

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We will explore the relationship with qualitative variables with mosaic plots and conditional proportions and classify each relationship as existent or not. We explore the box plots and means for each quantitative variable explanatory variable then classify the relationships as existent or not.

#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)

#contingency table for counts (table(rows, columns))
table(birdkeeping$Bird,birdkeeping$LungCancer)

#Convert contingency table to proportions, margin=2 specifices totalling over the columns
prop.table(table(birdkeeping$Bird,birdkeeping$LungCancer),margin=2)

   
            0         1
  0 0.6530612 0.3265306
  1 0.3469388 0.6734694

#Side by side boxplots and summary stats for quantitative
boxplot(Smoked~LungCancer, data=birdkeeping)

tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)

$`0`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   16.00   25.00   24.96   38.75   47.00 

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   27.00   36.00   33.63   41.00   50.00

We see there might be an association with whether a person owned a bird and whether they have lung cancer. We that for those without lung cancer, 35 percent owned a bird. Compared to the 67 percent of those who had lung cancer owning a bird.
We see a slight association with the number of years a person smoked and having lung cancer. On average, those who have lung cancer smoked longer.
We will not explore interactions in this example.
We can also evaluate multicollinearity between the predictors.
We can also use variable screening in logistic regression.

Step 3: Estimate the model parameters

Here we will fit the model with all of the predictors.

NOTE: It is good practice to define what you want to be treated as the base level, or “failure” Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is a string.

# Be sure to clearly define the base level for response
birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"0")

# We want to ensure the explanatory variables are treated as factor
# levels, instead of numeric 1 and 0
# We only do this if a variable meant to be qualitative is read in as numeric
birdkeeping$Gender<-as.factor(birdkeeping$Gender)
birdkeeping$Status<-as.factor(birdkeeping$Status)
birdkeeping$Bird<-as.factor(birdkeeping$Bird)

# you will know a variable is treated as a factor if it outputs as such in the
# model with the level names (or you can check the type in the environment)
# we will specify family=binomial()
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)


Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
Gender1      0.56127    0.53116   1.057 0.290653    
Status1      0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
Bird1        1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5

MODEL:

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

PREDICTION EQUATION:

\(\hat{\pi^*}=-1.937+0.56 Gender+0.105 Status-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36Bird\)

Where:

Gender= 1 if female, 0 otherwise
Status= 1 if high economic status, 0 otherwise
Bird= 1 if own a bird, 0 otherwise

INTERPRETATIONS

# Extract just coefficients (or intervals)
coefficients(birdmod1)

(Intercept)     Gender1     Status1         Age      Smoked  Cigarettes 
-1.93735861  0.56127270  0.10544761 -0.03975542  0.07286848  0.02601689 
      Bird1 
 1.36259456

confint(birdmod1)

Waiting for profiling to be done...

                  2.5 %     97.5 %
(Intercept) -5.59554204 1.54015158
Gender1     -0.48194383 1.61625027
Status1     -0.82695901 1.02585600
Age         -0.11263170 0.02827386
Smoked       0.02529552 0.13082279
Cigarettes  -0.02407110 0.07680737
Bird1        0.57309984 2.19296212

#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))

(Intercept)     Gender1     Status1         Age      Smoked  Cigarettes 
  0.1440840   1.7529020   1.1112079   0.9610245   1.0755891   1.0263583 
      Bird1 
  3.9063153

exp(confint(birdmod1))

Waiting for profiling to be done...

                  2.5 %   97.5 %
(Intercept) 0.003714385 4.665297
Gender1     0.617581748 5.034178
Status1     0.437377326 2.789482
Age         0.893479671 1.028677
Smoked      1.025618164 1.139766
Cigarettes  0.976216301 1.079834
Bird1       1.773756897 8.961720

Quantitative Interpretations:

For a one unit year in age, the estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer decreases by 0.034, given the other variables are held constant.
For a one unit year in age, the estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer decreases by a factor of 0.96, given the other variables are held constant.
We are 95% confident that for a one unit year in age, the odds of a person having lung cancer changes by a factor between 0.893 and 1.02, given the other variables are held constant. (Since 1 is in this interval the variable “age” is not a significant predictor.)

Qualitative Interpretations:

The estimated log-odds (\(\hat{\pi^*}\)) of a person having lung cancer is 1.36 higher for those who own a bird, given the other variables are held constant.
The estimated odds (\(\widehat{\frac{\pi}{1-\pi}}\)) of a person having lung cancer is 3.9 times higher for those who own a bird, given the other variables are held constant.
We are 95% confident that the odds of a person having lung cancer increases by a factor between 1.77 and 8.96 given the other variables are held constant. (Since 1 is not in this interval “bird” is a significant predictor)

Step 4: Assess the model (Unit 4.2)

#Compute AIC and BIC
AIC(birdmod1)

[1] 168.1984

BIC(birdmod1)

[1] 189.1314

#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod1)


   The Hosmer-Lemeshow goodness-of-fit test

 Group Size Observed  Expected
     1   15        0 0.4319818
     2   15        2 1.3729248
     3   15        4 2.2192168
     4   15        3 3.3585724
     5   15        4 4.1808653
     6   15        3 5.0285541
     7   15        6 6.6485045
     8   15        8 8.3305082
     9   15       11 9.1584956
    10   12        8 8.2703765

         Statistic =  4.85068 
degrees of freedom =  8 
           p-value =  0.77341

#Metrics of Association
#requires blorr package
blr_pairs(birdmod1)

We will evaluate the logistic regression model with AIC, BIC, Hosmer-Lemeshow Goodness of Fit Test, and metrics of association. Individually interpreting these metrics are not practically useful. But these are more typically used to compare between two or more models to determine which is best.

AIC = 168.1984
BIC = 189.1314
Hosmer-Lemeshow Test:
- \(H_0\): The logistic regression model provides an adequate fit to the data
- With a p-value of 0.77, we fail to reject the null hypothesis. There is not enough evidence to determine the model does not provide a good fit. We can move forward in the analysis.
Our metrics of association are relatively low, which would not be good, but we do not have another model to compare to.

Step 5: Evaluate the model

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

First we Perform the model adequacy test (Likelihood Ratio Test):

#REPRINT MODEL
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)


Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
Gender1      0.56127    0.53116   1.057 0.290653    
Status1      0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
Bird1        1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5

#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest

[1] 32.93675

#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue

[1] 1.078379e-05

Hypotheses:
- \(H_0: \beta_1=...=\beta_6=0\) (the model is not adequate)
- \(H_a\):at least one of \(\beta_1 ,...,\beta_6\) does not equal 0 (the model is adequate)
Distribution of test statistic: chi-square with 6 DF
Test Statistic: \(\chi^2=32.93675\)
Pvalue: <0.0001
Decision: 0.0001<0.05 -> REJECT H0
Conclusion: The model with these predictors is adequate at predicting the probability of having lung cancer.

Then we test “the most important predictors”:

Test the Significance of Bird (one predictor- Wald Test)

Hypotheses:
- \(H_0: \beta_6=0\)
- \(H_a:\beta_6 \neq 0\)
Distribution of test statistic: Chi Square with 1 DF OR Z
Test Statistic: \(\chi^2=3.31^2\) OR Z=3.31
Pvalue: 0.000923
Decision: 0.000923<0.05 -> REJECT H0
Conclusion: Owning a bird contributes to predicting the probability of having lung cancer, given the other variables are present in the model.

Test the Significance of smoking related variables (two predictors- drop in deviance test)

Hypotheses:
- \(H_0: \beta_4=\beta_5=0\)
- \(H_a:\beta_4=\beta_5 \neq 0\)
- reduced model: \(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_6Bird\)

#Fit the reduced model
birdmod2<-glm(LungCancer~.-Smoked-Cigarettes,data=birdkeeping,family=binomial())
summary(birdmod2)


Call:
glm(formula = LungCancer ~ . - Smoked - Cigarettes, family = binomial(), 
    data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.01412    1.59950  -1.259 0.207951    
Gender1     -0.28884    0.45248  -0.638 0.523245    
Status1     -0.24848    0.42613  -0.583 0.559822    
Age          0.01313    0.02653   0.495 0.620557    
Bird1        1.40133    0.38831   3.609 0.000308 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 171.87  on 142  degrees of freedom
AIC: 181.87

Number of Fisher Scoring iterations: 4

#Perform drop in deviance test
anova(birdmod2, birdmod1, test = "Chisq")

Distribution of test statistic: Chi Square with 2 DF
Test Statistic: \(\chi^2=17.669\)
Pvalue: 0.000145
Decision: 0.000145<0.05 -> REJECT H0
Conclusion: The smoking related variables (number of years smoked, and number of cigarettes smoked per day) contributes to predicting the probability of having lung cancer, given the other variables are present in the model.

That will conclude our variable testing for this example.

Step 6: Check the Residuals

This step will not be evaluated in this course.

blr_plot_difchisq_leverage(birdmod1)

blr_plot_diag_influence(birdmod1)

blr_residual_diagnostics(birdmod1)

plot(birdmod1,which=5)

We are not concerned about the trends in these plots for identifying any outliers. We are looking to identify observations that are “far” from the others.

Step 7: Use the model for prediction

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

#compute pi-hat values. These are probabilities of success for each value.
birdkeeping$fits<-fitted.values(birdmod1)
head(birdkeeping)

#summarize the "correctly" classified predictions
blr_confusion_matrix(birdmod1)

Confusion Matrix and Statistics 

          Reference
Prediction  0  1
         0 80 22
         1 18 27

                Accuracy : 0.7279 
     No Information Rate : 0.6667 

                   Kappa : 0.3750 

McNemars's Test P-Value  : 0.6353 

             Sensitivity : 0.5510 
             Specificity : 0.8163 
          Pos Pred Value : 0.6000 
          Neg Pred Value : 0.7843 
              Prevalence : 0.3333 
          Detection Rate : 0.1837 
    Detection Prevalence : 0.3061 
       Balanced Accuracy : 0.6837 
               Precision : 0.6000 
                  Recall : 0.5510 

        'Positive' Class : 1

We can compare how well our model does at predicting within the data set. For example, observation 1 was actually a “success”, but had a predicted probability of 0.41.

Final note: We can classify “correct” and “incorrect” conclusions. The default cutoff is typically 0.5. Meaning is \(\hat{\pi}\) is greater than 0.5 it will be labeled as a “success”. You can change this threshold depending on the context.

sensitivity is the ability of the model to correctly identify a success (true positive rate)
specificity is the ability of the model to correctly identify a failure (true negative rate)
accuracy overall the percentage of correctly identified results
McNemars’s Test has a null that there is no difference between the proportion of false positives and false negatives (good thing). A pvalue greater than alpha means our model doesn’t over correct in either direction.

STAT 3220 Unit 4.1-4.2: Logistic Regression

Step 1: Collect the Data

Defining the response

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

Step 3: Estimate the model parameters

MODEL:

PREDICTION EQUATION:

INTERPRETATIONS

Step 4: Assess the model (Unit 4.2)

Step 5: Evaluate the model

First we Perform the model adequacy test (Likelihood Ratio Test):

Then we test “the most important predictors”:

Test the Significance of Bird (one predictor- Wald Test)

Test the Significance of smoking related variables (two predictors- drop in deviance test)

Step 6: Check the Residuals

Step 7: Use the model for prediction