Unit 4.1-4.2: Logistic Regression

Holst, Kromhout, and Brand conducted a case-control study to determine if keeping birds increased the chances of getting lung cancer. It is believed that people who keep birds may inhale more allergens and dust particles and thus may be more likely to get lung cancer. They collected data on 49 people with lung cancer and 98 people with similar demographics who did not have lung cancer.

Gender: 1 represents females
Status: 1 represents high economic status
Age: Owner’s age
Smoked: number of years the person has smoked
Cigarettes: number of cigarettes smoked per day
Bird: 1 represents own a bird
Lung Cancer: 1 represents having lung cancer

Step 1: Collect the Data

Check the appropriateness of response variable for LOGISTIC regression: We want to look at the breakdown of successes and failures. It is useful if there is a relatively even number of successes and failures in the data as that will allow for more accurate estimates of the coefficients. (Which lead to more accurate predictions)

Defining the response

\(\pi=P(Y=1)=\) The probability of having lung cancer given the set of predictors.

birdkeeping<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/C7%20BirdKeeping.csv",header=T)
head(birdkeeping)

names(birdkeeping)

[1] "Gender"     "Status"     "Age"        "Smoked"     "Cigarettes"
[6] "Bird"       "LungCancer"

table(birdkeeping$LungCancer)


 0  1 
98 49

Although this break down is not even, there is a suitable number of observations in each group

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We will explore the relationship with qualitative variables with mosaic plots and proportion and classify each relationship as existent or not. We explore the box plots and means for each quantitative variable explanatory variable then classify the relationships as existent or not.

#Mosaic Plots and proportions for qualitative variables
mosaicplot(LungCancer~Bird,data=birdkeeping)

prop.table(table(birdkeeping$LungCancer,birdkeeping$Bird),margin=2)

   
            0         1
  0 0.8000000 0.5074627
  1 0.2000000 0.4925373

#Side by side boxplots and summary stats for quantitative
boxplot(Smoked~LungCancer, data=birdkeeping)

tapply(birdkeeping$Smoked,birdkeeping$LungCancer,summary)

$`0`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   16.00   25.00   24.96   38.75   47.00 

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   27.00   36.00   33.63   41.00   50.00

We see there might be an association with whether a person owned a bird and whether they have lung cancer. We that for those without lung cancer, 20 percent owned a bird. Compared to the 50 percent of those who had lung cancer owning a bird.
We see a slight association with the number of years a person smoked and having lung cancer. On average, those who have lung cancer smoked longer.
We will not explore interactions in this example.
We can also evaluate multicollinearity between the predictors.
We can also use variable screening in logistic regression.

Step 3: Estimate the model parameters

Here we will fit the model with all of the predictors.

NOTE: It is good practice to define what you want to be treated as the base level Y=0. You can do this with the relevel function. The relevel function can be used even if the original column is a string.

birdkeeping$LungCancer<-relevel(factor(birdkeeping$LungCancer),"0")
birdmod1<-glm(LungCancer~.,data=birdkeeping,family=binomial())
summary(birdmod1)


Call:
glm(formula = LungCancer ~ ., family = binomial(), data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.93736    1.80425  -1.074 0.282924    
Gender       0.56127    0.53116   1.057 0.290653    
Status       0.10545    0.46885   0.225 0.822050    
Age         -0.03976    0.03548  -1.120 0.262503    
Smoked       0.07287    0.02649   2.751 0.005940 ** 
Cigarettes   0.02602    0.02552   1.019 0.308055    
Bird         1.36259    0.41128   3.313 0.000923 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 154.20  on 140  degrees of freedom
AIC: 168.2

Number of Fisher Scoring iterations: 5

MODEL:

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

PREDICTION EQUATION:

\(\hat{\pi^*}=-1.937+0.56 Gender+0.105 Status-0.039 Age+0.073 Smoked+0.026Cigarettes+1.36Bird\)

INTERPRETATIONS

# Extract just coefficients (or intervals)
coefficients(birdmod1)

(Intercept)      Gender      Status         Age      Smoked  Cigarettes 
-1.93735861  0.56127270  0.10544761 -0.03975542  0.07286848  0.02601689 
       Bird 
 1.36259456

confint(birdmod1)

Waiting for profiling to be done...

                  2.5 %     97.5 %
(Intercept) -5.59554204 1.54015158
Gender      -0.48194383 1.61625027
Status      -0.82695901 1.02585600
Age         -0.11263170 0.02827386
Smoked       0.02529552 0.13082279
Cigarettes  -0.02407110 0.07680737
Bird         0.57309984 2.19296212

#Exponentiation of betas (or intervals)
exp(coefficients(birdmod1))

(Intercept)      Gender      Status         Age      Smoked  Cigarettes 
  0.1440840   1.7529020   1.1112079   0.9610245   1.0755891   1.0263583 
       Bird 
  3.9063153

exp(confint(birdmod1))

Waiting for profiling to be done...

                  2.5 %   97.5 %
(Intercept) 0.003714385 4.665297
Gender      0.617581748 5.034178
Status      0.437377326 2.789482
Age         0.893479671 1.028677
Smoked      1.025618164 1.139766
Cigarettes  0.976216301 1.079834
Bird        1.773756897 8.961720

Quantitative Interpretations:

For a one unit year in age, the log-odds of a person having lung cancer decreases by 0.034, given the other variables are held constant.
For a one unit year in age, the odds of a person having lung cancer decreases by a factor of 0.96, given the other variables are held constant.
We are 95% confident that for a one unit year in age, the odds of a person having lung cancer changes by a factor between 0.893 and 1.02, given the other variables are held constant. (Since 1 is in this interval the variable “age” is not a significant predictor.)

Qualitative Interpretations:

The log-odds of a person having lung cancer is 1.36 higher for those who own a bird, given the other variables are held constant.
The odds of a person having lung cancer is 3.9 times higher for those who own a bird, given the other variables are held constant.
We are 95% confident that the odds of a person having lung cancer increases by a factor between 1.77 and 8.96 given the other variables are held constant. (Since 1 is not in this interval “bird” is a significant predictor)

Step 4: Assess the model

#Compute AIC and BIC
AIC(birdmod1)

[1] 168.1984

BIC(birdmod1)

[1] 189.1314

#Hosmer Lemeshow test
#Requires glmtoolbox package
hltest(birdmod1)


   The Hosmer-Lemeshow goodness-of-fit test

 Group Size Observed  Expected
     1   15        0 0.4319818
     2   15        2 1.3729248
     3   15        4 2.2192168
     4   15        3 3.3585724
     5   15        4 4.1808653
     6   15        3 5.0285541
     7   15        6 6.6485045
     8   15        8 8.3305082
     9   15       11 9.1584956
    10   12        8 8.2703765

         Statistic =  4.85068 
degrees of freedom =  8 
           p-value =  0.77341

#Metrics of Association
#requires blorr package
blr_pairs(birdmod1)

We will evaluate the logistic regression model with AIC, BIC, Hosmer-Lemeshow Goodness of Fit Test, and metrics of association. Individually these metrics are not useful. But these are more typically used to compare between two or more models to determine which is best.

AIC = 168.1984
BIC = 189.1314
Hosmer-Lemeshow Test:
- \(H_0\): The logistic regression model provides an adequate fit to the data
- With a p-value of 0.77, we fail to reject the null hypothesis. There is not enough evidence to determine the model does not provide a good fit. We can move forward in the analysis.
Our metrics of association are relatively low, which would not be good, but we do not have another model to compare to.

Step 5: Evaluate the model

\(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_4 Smoked+\beta_5Cigarettes+\beta_6Bird\)

First we Perform the model adequacy test (Likelihood Ratio Test):

#Compute the test statistic
lltest<-birdmod1$null.deviance-birdmod1$deviance
lltest

[1] 32.93675

#compute the pvalue
#the second argument is the degrees of freedom = k
llpvalue<-1-pchisq(lltest,6)
llpvalue

[1] 1.078379e-05

Hypotheses:
- \(H_0: \beta_1=...=\beta_6=0\) (the model is not adequate)
- \(H_a\):at least one of \(\beta_1 ,...,\beta_6\) does not equal 0 (the model is adequate)
Distribution of test statistic: chi-square with 6 DF
Test Statistic: \(\chi^2=32.93675\)
Pvalue: <0.0001
Decision: 0.0001<0.05 -> REJECT H0
Conclusion: The model with these predictors is adequate at predicting the probability of having lung cancer.

Then we test “the most important predictors”:

Test the Significance of Bird (one predictor- Wald Test)

Hypotheses:
- \(H_0: \beta_6=0\)
- \(H_a:\beta_6 \neq 0\)
Distribution of test statistic: Chi Square with 1 DF OR Z
Test Statistic: \(\chi^2=3.31^2\) OR Z=3.31
Pvalue: 0.000923
Decision: 0.000923<0.05 -> REJECT H0
Conclusion: Owning a bird contributes to predicting the probability of having lung cancer, given the other variables are present in the model.

Test the Significance of smoking related variables (two predictors- drop in deviance test)

Hypotheses:
- \(H_0: \beta_4=\beta_5=0\)
- \(H_a:\beta_4=\beta_5 \neq 0\)
- reduced model: \(\pi^*=\ln(\frac{\pi}{1-\pi})=\beta_0+\beta_1 Gender+\beta_2 Status+\beta_3 Age+\beta_6Bird\)

#Fit the reduced model
birdmod2<-glm(LungCancer~.-Smoked-Cigarettes,data=birdkeeping,family=binomial())
summary(birdmod2)


Call:
glm(formula = LungCancer ~ . - Smoked - Cigarettes, family = binomial(), 
    data = birdkeeping)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.01412    1.59950  -1.259 0.207951    
Gender      -0.28884    0.45248  -0.638 0.523245    
Status      -0.24848    0.42613  -0.583 0.559822    
Age          0.01313    0.02653   0.495 0.620557    
Bird         1.40133    0.38831   3.609 0.000308 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 187.14  on 146  degrees of freedom
Residual deviance: 171.87  on 142  degrees of freedom
AIC: 181.87

Number of Fisher Scoring iterations: 4

#Perform drop in deviance test
anova(birdmod2, birdmod1, test = "Chisq")

Distribution of test statistic: Chi Square with 2 DF
Test Statistic: \(\chi^2=17.669\)
Pvalue: 0.000145
Decision: 0.000145<0.05 -> REJECT H0
Conclusion: The smoking related variables (number of years smoked, and number of cigarettes smoked per day) contributes to predicting the probability of having lung cancer, given the other variables are present in the model.

That will conclude our variable testing for this example.

Step 6: Check the Residuals

blr_plot_difchisq_leverage(birdmod1)

blr_plot_diag_influence(birdmod1)

blr_residual_diagnostics(birdmod1)

plot(birdmod1,which=5)

We are not concerned about the trends in these plots for identifying any outliers.

Step 7: Use the model for prediction

#compute pi-hat values. These are probabilities of success for each value.
birdkeeping$fits<-fitted.values(birdmod1)
head(birdkeeping)

#summarize the "correctly" classified predictions
blr_confusion_matrix(birdmod1)

Confusion Matrix and Statistics 

          Reference
Prediction  0  1
         0 80 22
         1 18 27

                Accuracy : 0.7279 
     No Information Rate : 0.6667 

                   Kappa : 0.3750 

McNemars's Test P-Value  : 0.6353 

             Sensitivity : 0.5510 
             Specificity : 0.8163 
          Pos Pred Value : 0.6000 
          Neg Pred Value : 0.7843 
              Prevalence : 0.3333 
          Detection Rate : 0.1837 
    Detection Prevalence : 0.3061 
       Balanced Accuracy : 0.6837 
               Precision : 0.6000 
                  Recall : 0.5510 

        'Positive' Class : 1

We can compare how well our model does at predicting within the data set. For example, observation 1 was a “success”, but had a predicted probability of 0.41.

Final note: We can classify “correct” and “incorrect” conclusions. the default cutoff is typically 0.5. - sensitivity is the ability of a test to correctly identify a success (true positive rate) - specificity is the ability of the test to correctly identify a failure (true negative rate) - accuracy overall the percentage of correctly identified results - McNemars’s Test has a null that there is no difference between the proportion of false positives and false negatives (good thing). A pvalue greater than alpha our model doesn’t over correct in either direction.

STAT 3220 Unit 4.1-4.2