The following boxplots show the distribution of the breastfeeding status in the age of child, the age of mother, and parity.
According to the boxplots, children received breast feeding have lower age than those who do not receive breast feeding. We can see the large difference in age distribution between two genders. The 1st qualtile of the age of non-breast feeding children is higher than the 3rd qualtile of that of breast feeding children. The age of mothers who provide breastfeeding to their child is younger than that of mothers who do not. The mean of parity is similar regardless of the breastfeeding status. The prevalence of breastfeeding is 40.87% in boys and 41.82% in girls.
# import dataset and check the missing value
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nepalibf = read_csv("nepalibf.csv")
## Rows: 500 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): obs, bf, parity, sex_chld, age_chld, age_mom, age_chld_quart, age_c...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nepalibf)
## # A tibble: 6 × 8
## obs bf parity sex_chld age_chld age_mom age_chld_quart age_chldmed_quart
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 66 1 1 1 14 17 0 13
## 2 9 1 1 0 3 18 0 13
## 3 67 1 1 1 14 18 0 13
## 4 105 1 1 0 19 18 0 13
## 5 10 1 1 1 3 19 0 13
## 6 42 1 1 0 11 19 0 13
summary(nepalibf)
## obs bf parity sex_chld
## Min. : 1.0 Min. :0.0000 Min. : 1.000 Min. :0.000
## 1st Qu.:125.8 1st Qu.:0.0000 1st Qu.: 2.750 1st Qu.:0.000
## Median :250.5 Median :0.0000 Median : 4.000 Median :0.000
## Mean :250.5 Mean :0.4131 Mean : 4.446 Mean :0.464
## 3rd Qu.:375.2 3rd Qu.:1.0000 3rd Qu.: 6.000 3rd Qu.:1.000
## Max. :500.0 Max. :1.0000 Max. :14.000 Max. :1.000
## NA's :28
## age_chld age_mom age_chld_quart age_chldmed_quart
## Min. : 0.00 Min. :17.00 Min. : 0.00 Min. :13.00
## 1st Qu.:21.00 1st Qu.:24.00 1st Qu.:21.00 1st Qu.:28.00
## Median :36.00 Median :28.00 Median :36.00 Median :44.00
## Mean :37.34 Mean :28.65 Mean :28.37 Mean :37.53
## 3rd Qu.:53.00 3rd Qu.:34.00 3rd Qu.:53.00 3rd Qu.:62.00
## Max. :76.00 Max. :52.00 Max. :53.00 Max. :62.00
##
table(nepalibf$bf)
##
## 0 1
## 277 195
# Compare boxplots for the continuous covariates (X’s) vs breastfeeding status
boxplot(age_chld ~ bf, data=nepalibf, ylab="Age of child (months)")
boxplot(age_mom ~ bf, data=nepalibf)
boxplot(parity ~ bf, data=nepalibf)
# Tabulate the proportion breastfed by gender:
CT = xtabs(~ sex_chld + bf, data=nepalibf)
addmargins(CT)
## bf
## sex_chld 0 1 Sum
## 0 149 103 252
## 1 128 92 220
## Sum 277 195 472
prop.table(CT, margin=1)
## bf
## sex_chld 0 1
## 0 0.5912698 0.4087302
## 1 0.5818182 0.4181818
addmargins(prop.table(CT, margin=1), margin=2)
## bf
## sex_chld 0 1 Sum
## 0 0.5912698 0.4087302 1.0000000
## 1 0.5818182 0.4181818 1.0000000
Definition of the variables
Interpretation of the coefficients
# centered at the mean age
nepalibf = nepalibf %>%
mutate(agechldc = age_chld - mean(age_chld))
# make a logistic regression
model1 = glm(bf ~ sex_chld + agechldc, data=nepalibf,family=binomial(link="logit"))
summary(model1)
##
## Call:
## glm(formula = bf ~ sex_chld + agechldc, family = binomial(link = "logit"),
## data = nepalibf)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9197 0.2251 -4.085 4.4e-05 ***
## sex_chld -0.3111 0.3159 -0.985 0.325
## agechldc -0.1868 0.0176 -10.613 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 640.01 on 471 degrees of freedom
## Residual deviance: 264.04 on 469 degrees of freedom
## (28 observations deleted due to missingness)
## AIC: 270.04
##
## Number of Fisher Scoring iterations: 6
exp(model1$coefficients)
## (Intercept) sex_chld agechldc
## 0.3986168 0.7326530 0.8296270
exp(confint.default(model1))
## 2.5 % 97.5 %
## (Intercept) 0.2563965 0.6197251
## sex_chld 0.3944773 1.3607386
## agechldc 0.8014982 0.8587430
Based on the Z-test result for the interaction coefficient of agechldc:sex_chld, we fail to reject the null hypothesis that the interaction coefficient is equal to zero with the p-value, 0.314. It can be the weak evidence of a different age-prevalence relationship for boys and girls, indicating that sex is not a effect measure modifier of the association between breastfeeding and gender.
Interpretation of the coefficients
# use interaction model
modelE = glm(bf~agechldc+sex_chld+sex_chld:agechldc, data=nepalibf, family=binomial(link="logit"))
summary(modelE)
##
## Call:
## glm(formula = bf ~ agechldc + sex_chld + sex_chld:agechldc, family = binomial(link = "logit"),
## data = nepalibf)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.85461 0.22449 -3.807 0.000141 ***
## agechldc -0.17177 0.02169 -7.921 2.36e-15 ***
## sex_chld -0.52147 0.38500 -1.354 0.175584
## agechldc:sex_chld -0.03725 0.03699 -1.007 0.313921
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 640.01 on 471 degrees of freedom
## Residual deviance: 262.99 on 468 degrees of freedom
## (28 observations deleted due to missingness)
## AIC: 270.99
##
## Number of Fisher Scoring iterations: 7
Both gender have the highest breastfeeding prevalence, 100%, at their birth. However, males have the higher breastfeeding prevalence than females in most of follow-up period from birth to 5 years old. The breastfeeding prevalence of males is 0-0.25% higher than that of females.
# separate data for boys and girls
nepalibf = nepalibf %>%
na.omit() %>% # Remove observations with missing data
mutate(sex_chld=recode_factor(sex_chld, `0`="Male", `1`="Female")) # Factor sex_chld
# make a logistic regression model
modelD = glm(bf ~ sex_chld + age_chld, data=nepalibf, family=binomial(link="logit"))
nepalibf = nepalibf %>% mutate(phat = predict(modelD, type="response"))
# draw a graph of the estimated breast feeding prevalence -vs- child’s age with separate curves for boys and girls.
qplot(x=age_chld, y=phat, color=sex_chld, shape=sex_chld, data=nepalibf,
xlab="Child's age in months", ylab="Predicted prevalence of Breast-feeding")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The result of the Hosmer-Lemeshow test shows the evidence of a good fit of model. With the large p-value, 0.6405, we fail to reject the null hypothesis that model accrately characterizes the data distribution.
We conducted a cross-sectional study in order to figure out the potential risk factors for the breastfeeding. The observed prevalence of breastfeeding is 40.87% in boys and 41.82% in girls. The younger the age of child and mother, and the less parity decrease the breastfeeding prevalence [Figure 1]. The age of child is statistically significant associated with the breastfeeding, adjusting for gender (Odds Ratio(OR) = 0.83, 95% CI (0.80 - 0.85)). However, there is no siginificant association between child;s gender and the breastfeeding status (OR = 0.73, 95% CI (0.39 – 1.36)). Furthermore, the gender is not a potential effect measure modifier (p value for the interaction term = 0.314). We can see the drastic drop of breastfeeding prevalence in children aged more than 20 months and less than 12% of children receive breastfeeding after their age 40 months [Figure 2]. According to the Hosmer-Lemeshow test, our multiple logistic regression model has a good fit with the observed values (p-value = 0.6405).
#utils::install.packages("ResourceSelection")
library(ResourceSelection)
## ResourceSelection 0.3-6 2023-06-27
#Hosmer-Lemeshow goodness-of- fit test
hoslem.test(nepalibf$bf, nepalibf$phat, g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: nepalibf$bf, nepalibf$phat
## X-squared = 6.0598, df = 8, p-value = 0.6405
hoslem.test(nepalibf$bf, nepalibf$phat)$observed
##
## cutyhat y0 y1
## [0.000291,0.00245] 48 0
## (0.00245,0.0121] 48 0
## (0.0121,0.0301] 45 2
## (0.0301,0.103] 42 5
## (0.103,0.273] 39 9
## (0.273,0.535] 32 14
## (0.535,0.768] 15 32
## (0.768,0.915] 7 40
## (0.915,0.971] 1 46
## (0.971,0.998] 0 47
hoslem.test(nepalibf$bf, nepalibf$phat)$expected
##
## cutyhat yhat0 yhat1
## [0.000291,0.00245] 47.9504680 0.0495320
## (0.00245,0.0121] 47.6775128 0.3224872
## (0.0121,0.0301] 46.0230415 0.9769585
## (0.0301,0.103] 43.9456280 3.0543720
## (0.103,0.273] 38.3778845 9.6221155
## (0.273,0.535] 27.5374715 18.4625285
## (0.535,0.768] 16.3517203 30.6482797
## (0.768,0.915] 6.4522289 40.5477711
## (0.915,0.971] 2.1299058 44.8700942
## (0.971,0.998] 0.5541386 46.4458614