140.622.04_Logistic Regression: Nested Models, Fit

1. Explore data by using boxplots table

The following boxplots show the distribution of the breastfeeding status in the age of child, the age of mother, and parity.

According to the boxplots, children received breast feeding have lower age than those who do not receive breast feeding. We can see the large difference in age distribution between two genders. The 1st qualtile of the age of non-breast feeding children is higher than the 3rd qualtile of that of breast feeding children. The age of mothers who provide breastfeeding to their child is younger than that of mothers who do not. The mean of parity is similar regardless of the breastfeeding status. The prevalence of breastfeeding is 40.87% in boys and 41.82% in girls.

# import dataset and check the missing value
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

nepalibf = read_csv("nepalibf.csv")

## Rows: 500 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): obs, bf, parity, sex_chld, age_chld, age_mom, age_chld_quart, age_c...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(nepalibf)

## # A tibble: 6 × 8
##     obs    bf parity sex_chld age_chld age_mom age_chld_quart age_chldmed_quart
##   <dbl> <dbl>  <dbl>    <dbl>    <dbl>   <dbl>          <dbl>             <dbl>
## 1    66     1      1        1       14      17              0                13
## 2     9     1      1        0        3      18              0                13
## 3    67     1      1        1       14      18              0                13
## 4   105     1      1        0       19      18              0                13
## 5    10     1      1        1        3      19              0                13
## 6    42     1      1        0       11      19              0                13

summary(nepalibf)

##       obs              bf             parity          sex_chld    
##  Min.   :  1.0   Min.   :0.0000   Min.   : 1.000   Min.   :0.000  
##  1st Qu.:125.8   1st Qu.:0.0000   1st Qu.: 2.750   1st Qu.:0.000  
##  Median :250.5   Median :0.0000   Median : 4.000   Median :0.000  
##  Mean   :250.5   Mean   :0.4131   Mean   : 4.446   Mean   :0.464  
##  3rd Qu.:375.2   3rd Qu.:1.0000   3rd Qu.: 6.000   3rd Qu.:1.000  
##  Max.   :500.0   Max.   :1.0000   Max.   :14.000   Max.   :1.000  
##                  NA's   :28                                       
##     age_chld        age_mom      age_chld_quart  age_chldmed_quart
##  Min.   : 0.00   Min.   :17.00   Min.   : 0.00   Min.   :13.00    
##  1st Qu.:21.00   1st Qu.:24.00   1st Qu.:21.00   1st Qu.:28.00    
##  Median :36.00   Median :28.00   Median :36.00   Median :44.00    
##  Mean   :37.34   Mean   :28.65   Mean   :28.37   Mean   :37.53    
##  3rd Qu.:53.00   3rd Qu.:34.00   3rd Qu.:53.00   3rd Qu.:62.00    
##  Max.   :76.00   Max.   :52.00   Max.   :53.00   Max.   :62.00    
##

table(nepalibf$bf)

## 
##   0   1 
## 277 195

# Compare boxplots for the continuous covariates (X’s) vs breastfeeding status
boxplot(age_chld ~ bf, data=nepalibf, ylab="Age of child (months)")

boxplot(age_mom ~ bf, data=nepalibf)

boxplot(parity ~ bf, data=nepalibf)

# Tabulate the proportion breastfed by gender:
CT = xtabs(~ sex_chld + bf, data=nepalibf)
addmargins(CT)

##         bf
## sex_chld   0   1 Sum
##      0   149 103 252
##      1   128  92 220
##      Sum 277 195 472

prop.table(CT, margin=1)

##         bf
## sex_chld         0         1
##        0 0.5912698 0.4087302
##        1 0.5818182 0.4181818

addmargins(prop.table(CT, margin=1), margin=2)

##         bf
## sex_chld         0         1       Sum
##        0 0.5912698 0.4087302 1.0000000
##        1 0.5818182 0.4181818 1.0000000

2. Estimate the prevalence of breast feeding as a function of child’s age (centered at the mean age) and gender using logistic regression

Definition of the variables

Variable ‘sex_chld’ is a child’s gender: 0=male, 1=female,
Variable ‘agechldc’ is a age of child in month,
The logistic regression model is as following: Log(p/(1-p)) = - 0.9197 - 0.3111sex_chld – 0.1868agechldc

Interpretation of the coefficients

β0 (- 0.9197) : the log odds of breast feeding for new born boys. e^ β0 is the odds of breast feeding among new born boys.
β1 (- 0.3111) : the log odds ratio(=differnce in the log odds) of breast feeding comparing new born girls to new born boys. e^ β1 is the odds ratio of breast feeding comparing new born girls to new born boys.
β2 (- 0.1868) : the log odds ratio(=differnce in the log odds) of breast feeding for each month in age, for boys. e^ β1 is the odds ratio of breast feeding for each month in age, among boys.

# centered at the mean age
nepalibf = nepalibf %>%
  mutate(agechldc = age_chld - mean(age_chld))

# make a logistic regression 
model1 = glm(bf ~ sex_chld + agechldc, data=nepalibf,family=binomial(link="logit"))

summary(model1)

## 
## Call:
## glm(formula = bf ~ sex_chld + agechldc, family = binomial(link = "logit"), 
##     data = nepalibf)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.9197     0.2251  -4.085  4.4e-05 ***
## sex_chld     -0.3111     0.3159  -0.985    0.325    
## agechldc     -0.1868     0.0176 -10.613  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 640.01  on 471  degrees of freedom
## Residual deviance: 264.04  on 469  degrees of freedom
##   (28 observations deleted due to missingness)
## AIC: 270.04
## 
## Number of Fisher Scoring iterations: 6

exp(model1$coefficients)

## (Intercept)    sex_chld    agechldc 
##   0.3986168   0.7326530   0.8296270

exp(confint.default(model1))

##                 2.5 %    97.5 %
## (Intercept) 0.2563965 0.6197251
## sex_chld    0.3944773 1.3607386
## agechldc    0.8014982 0.8587430

3. Test the hypothesis that the association of breast feeding prevalence and child’s age is different for boys and girls using interaction model.

Based on the Z-test result for the interaction coefficient of agechldc:sex_chld, we fail to reject the null hypothesis that the interaction coefficient is equal to zero with the p-value, 0.314. It can be the weak evidence of a different age-prevalence relationship for boys and girls, indicating that sex is not a effect measure modifier of the association between breastfeeding and gender.

Interpretation of the coefficients

β0 (- 0.8546) : the log odds of breast feeding for new born boys. e^ β0 is the odds of breast feeding among new born boys.
β1 (- 0.1718) : the log odds ratio(=differnce in the log odds) of breast feeding for each 1 month in age for boys. e^ β1 is the odds ratio of breast feeding for each 1 month in age among boys.
β2 (- 0.5215) : the log odds ratio(=differnce in the log odds) of breast feeding in infant between girls and boys. e^ β2 is the odds ratio of breast feeding in infant between girls and boys.
β3 (- 0.0372) : the difference in log odds ratio of beast feeding for 1 month in age between girls and boys. e^ β3 is exponential of the log odds ratio of breast feeding in infant between girls and boys.

# use interaction model
modelE = glm(bf~agechldc+sex_chld+sex_chld:agechldc, data=nepalibf, family=binomial(link="logit"))
summary(modelE)

## 
## Call:
## glm(formula = bf ~ agechldc + sex_chld + sex_chld:agechldc, family = binomial(link = "logit"), 
##     data = nepalibf)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.85461    0.22449  -3.807 0.000141 ***
## agechldc          -0.17177    0.02169  -7.921 2.36e-15 ***
## sex_chld          -0.52147    0.38500  -1.354 0.175584    
## agechldc:sex_chld -0.03725    0.03699  -1.007 0.313921    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 640.01  on 471  degrees of freedom
## Residual deviance: 262.99  on 468  degrees of freedom
##   (28 observations deleted due to missingness)
## AIC: 270.99
## 
## Number of Fisher Scoring iterations: 7

4. Display a graph of the estimated breast feeding prevalence -vs- child’s age with separate curves for boys and girls.

Both gender have the highest breastfeeding prevalence, 100%, at their birth. However, males have the higher breastfeeding prevalence than females in most of follow-up period from birth to 5 years old. The breastfeeding prevalence of males is 0-0.25% higher than that of females.

# separate data for boys and girls
nepalibf = nepalibf %>%
  na.omit() %>%  # Remove observations with missing data
  mutate(sex_chld=recode_factor(sex_chld, `0`="Male", `1`="Female")) # Factor sex_chld

# make a logistic regression model
modelD = glm(bf ~ sex_chld + age_chld, data=nepalibf, family=binomial(link="logit"))
nepalibf = nepalibf %>% mutate(phat = predict(modelD, type="response"))

# draw a graph of the estimated breast feeding prevalence -vs- child’s age with separate curves for boys and girls.
qplot(x=age_chld, y=phat, color=sex_chld, shape=sex_chld, data=nepalibf,
      xlab="Child's age in months", ylab="Predicted prevalence of Breast-feeding")

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

5. Hosmer-Lemeshow goodness-of-fit test

The result of the Hosmer-Lemeshow test shows the evidence of a good fit of model. With the large p-value, 0.6405, we fail to reject the null hypothesis that model accrately characterizes the data distribution.

We conducted a cross-sectional study in order to figure out the potential risk factors for the breastfeeding. The observed prevalence of breastfeeding is 40.87% in boys and 41.82% in girls. The younger the age of child and mother, and the less parity decrease the breastfeeding prevalence [Figure 1]. The age of child is statistically significant associated with the breastfeeding, adjusting for gender (Odds Ratio(OR) = 0.83, 95% CI (0.80 - 0.85)). However, there is no siginificant association between child;s gender and the breastfeeding status (OR = 0.73, 95% CI (0.39 – 1.36)). Furthermore, the gender is not a potential effect measure modifier (p value for the interaction term = 0.314). We can see the drastic drop of breastfeeding prevalence in children aged more than 20 months and less than 12% of children receive breastfeeding after their age 40 months [Figure 2]. According to the Hosmer-Lemeshow test, our multiple logistic regression model has a good fit with the observed values (p-value = 0.6405).

#utils::install.packages("ResourceSelection")
library(ResourceSelection)

## ResourceSelection 0.3-6   2023-06-27

#Hosmer-Lemeshow goodness-of- fit test 
hoslem.test(nepalibf$bf, nepalibf$phat, g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  nepalibf$bf, nepalibf$phat
## X-squared = 6.0598, df = 8, p-value = 0.6405

hoslem.test(nepalibf$bf, nepalibf$phat)$observed

##                     
## cutyhat              y0 y1
##   [0.000291,0.00245] 48  0
##   (0.00245,0.0121]   48  0
##   (0.0121,0.0301]    45  2
##   (0.0301,0.103]     42  5
##   (0.103,0.273]      39  9
##   (0.273,0.535]      32 14
##   (0.535,0.768]      15 32
##   (0.768,0.915]       7 40
##   (0.915,0.971]       1 46
##   (0.971,0.998]       0 47

hoslem.test(nepalibf$bf, nepalibf$phat)$expected

##                     
## cutyhat                   yhat0      yhat1
##   [0.000291,0.00245] 47.9504680  0.0495320
##   (0.00245,0.0121]   47.6775128  0.3224872
##   (0.0121,0.0301]    46.0230415  0.9769585
##   (0.0301,0.103]     43.9456280  3.0543720
##   (0.103,0.273]      38.3778845  9.6221155
##   (0.273,0.535]      27.5374715 18.4625285
##   (0.535,0.768]      16.3517203 30.6482797
##   (0.768,0.915]       6.4522289 40.5477711
##   (0.915,0.971]       2.1299058 44.8700942
##   (0.971,0.998]       0.5541386 46.4458614