Lab 09

The lab specifications can be found [here]

Labels

DBWT - birth weight (g)
CIG_REC - smoked during pregnancy
risks - risk factors reported
MAGER - mother’s age
MRACEHISP - mother’s race/ethnicity
DMAR - marital status
MEDUC - mother’s education
PRIORLIVE - prior births now living
PRIORDEAD - prior births now dead
BMI - mother’s body mass index, (kg/m2)

Exercise 1

Use the table1 package to create a table with appropriate descriptive information about the data set. Use the examples provided in the following vignette to help you build a table.

# Write your code here
table1(~  DBWT + risks + MAGER + MRACEHISP + DMAR + MEDUC + PRIORLIVE + PRIORLIVE +PRIORDEAD + BMI | CIG_REC, data=data,
    overall=F)

	No (N=1564)	Yes (N=124)
Birth Weight (g)
Mean (SD)	3250 (601)	3070 (597)
Median [Min, Max]	3290 [369, 5880]	3090 [630, 4220]
Risk Factors Reported
No	1088 (69.6%)	84 (67.7%)
Yes	476 (30.4%)	40 (32.3%)
Mother Age
Mean (SD)	29.0 (5.79)	27.8 (5.79)
Median [Min, Max]	29.0 [15.0, 49.0]	28.0 [17.0, 42.0]
Mother Race/Hispanic Origin
NH White	844 (54.0%)	94 (75.8%)
NH Black	266 (17.0%)	14 (11.3%)
NH Other	125 (8.0%)	10 (8.1%)
Hispanic	329 (21.0%)	6 (4.8%)
Marital Status
Married	1017 (65.0%)	40 (32.3%)
Unmarried	547 (35.0%)	84 (67.7%)
Mother Education
lower	177 (11.3%)	34 (27.4%)
HS	377 (24.1%)	45 (36.3%)
Some college	437 (27.9%)	37 (29.8%)
Uni	573 (36.6%)	8 (6.5%)
PRIORLIVE
0	601 (38.4%)	38 (30.6%)
1	516 (33.0%)	41 (33.1%)
2	265 (16.9%)	17 (13.7%)
3	93 (5.9%)	17 (13.7%)
4+	89 (5.7%)	11 (8.9%)
PRIORDEAD
0	1551 (99.2%)	123 (99.2%)
1+	13 (0.8%)	1 (0.8%)
Mother BMI
Mean (SD)	26.9 (6.32)	27.3 (6.42)
Median [Min, Max]	25.6 [15.6, 58.4]	26.3 [16.5, 50.1]

Exercise 2

Use the ggpairs function to explore the association between the different variables. Try to replicate the figure below, or create your own variation.

# Write your code here
GGally::ggpairs(data, columns = c(1,3,4,6,10),
  mapping = ggplot2::aes(color = CIG_REC),
  title = "Natality in the United states (2018)" ,
  upper = list(continuous = "density", combo = "box_no_facet"),
  legend = 1
  #lower = list(continuous = "density", combo = "dot_no_facet")
  )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exercise 3

Fit four linear models, each nested within the other: the null model, the unadjusted model, the full model and the optimal model. Use the step function find the optimal model using the stepwise method. Then use anova to compare the four models. What is your conclusion? Now compare your models using modelsummary function. What does the number in the (brackets) mean? How can you identify the optimal model just by looking at the \(R^2\) and the AIC?

M1_null <- lm(DBWT ~ 1, data)
summary(M1_null)

## 
## Call:
## lm(formula = DBWT ~ 1, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2865.75  -314.75    35.25   394.25  2645.25 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3234.75      14.66   220.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 602.2 on 1687 degrees of freedom

M2_unadjusted <- lm(DBWT ~ CIG_REC, data)
equatiomatic::extract_eq(M2_unadjusted)

\[ \operatorname{DBWT} = \alpha + \beta_{1}(\operatorname{CIG\_REC}_{\operatorname{Yes}}) + \epsilon \]

summary(M2_unadjusted)

## 
## Call:
## lm(formula = DBWT ~ CIG_REC, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2878.9  -300.6    35.1   382.1  2632.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3247.90      15.19 213.867  < 2e-16 ***
## CIG_RECYes   -179.10      56.03  -3.196  0.00142 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 600.6 on 1686 degrees of freedom
## Multiple R-squared:  0.006023,   Adjusted R-squared:  0.005434 
## F-statistic: 10.22 on 1 and 1686 DF,  p-value: 0.001418

M3_full <- lm(DBWT ~ ., data)
equatiomatic::extract_eq(M3_full)

\[ \operatorname{DBWT} = \alpha + \beta_{1}(\operatorname{CIG\_REC}_{\operatorname{Yes}}) + \beta_{2}(\operatorname{risks}_{\operatorname{Yes}}) + \beta_{3}(\operatorname{MAGER}) + \beta_{4}(\operatorname{MRACEHISP}_{\operatorname{NH\ Black}}) + \beta_{5}(\operatorname{MRACEHISP}_{\operatorname{NH\ Other}}) + \beta_{6}(\operatorname{MRACEHISP}_{\operatorname{Hispanic}}) + \beta_{7}(\operatorname{DMAR}_{\operatorname{Unmarried}}) + \beta_{8}(\operatorname{MEDUC}_{\operatorname{HS}}) + \beta_{9}(\operatorname{MEDUC}_{\operatorname{Some\ college}}) + \beta_{10}(\operatorname{MEDUC}_{\operatorname{Uni}}) + \beta_{11}(\operatorname{PRIORLIVE}_{\operatorname{1}}) + \beta_{12}(\operatorname{PRIORLIVE}_{\operatorname{2}}) + \beta_{13}(\operatorname{PRIORLIVE}_{\operatorname{3}}) + \beta_{14}(\operatorname{PRIORLIVE}_{\operatorname{4+}}) + \beta_{15}(\operatorname{PRIORDEAD}_{\operatorname{1+}}) + \beta_{16}(\operatorname{BMI}) + \epsilon \]

summary(M3_full)

## 
## Call:
## lm(formula = DBWT ~ ., data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2741.78  -301.50    24.63   359.05  2394.91 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3162.373    107.021  29.549  < 2e-16 ***
## CIG_RECYes        -156.593     56.325  -2.780  0.00549 ** 
## risksYes          -152.902     32.055  -4.770 2.00e-06 ***
## MAGER               -3.836      3.085  -1.243  0.21391    
## MRACEHISPNH Black -290.175     41.829  -6.937 5.70e-12 ***
## MRACEHISPNH Other -120.774     52.960  -2.280  0.02271 *  
## MRACEHISPHispanic  -72.464     39.120  -1.852  0.06415 .  
## DMARUnmarried     -189.782     35.084  -5.409 7.24e-08 ***
## MEDUCHS              8.448     49.909   0.169  0.86561    
## MEDUCSome college   -0.240     50.838  -0.005  0.99623    
## MEDUCUni            31.610     56.901   0.556  0.57861    
## PRIORLIVE1         140.483     34.725   4.046 5.46e-05 ***
## PRIORLIVE2         123.385     43.880   2.812  0.00498 ** 
## PRIORLIVE3         161.298     63.149   2.554  0.01073 *  
## PRIORLIVE4+        162.508     67.562   2.405  0.01627 *  
## PRIORDEAD1+        265.466    154.184   1.722  0.08530 .  
## BMI                 10.493      2.277   4.609 4.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 572.4 on 1671 degrees of freedom
## Multiple R-squared:  0.1053, Adjusted R-squared:  0.09673 
## F-statistic: 12.29 on 16 and 1671 DF,  p-value: < 2.2e-16

# Finding the optimal model
Optimodel <- step(M1_null,
     scope = list(upper = M3_full),
     direction="both",
     data=data)

## Start:  AIC=21609.55
## DBWT ~ 1
## 
##             Df Sum of Sq       RSS   AIC
## + DMAR       1  28672236 583166851 21530
## + MRACEHISP  3  28557357 583281730 21535
## + PRIORLIVE  4   7485192 604353895 21597
## + MEDUC      3   5189885 606649203 21601
## + CIG_REC    1   3685187 608153901 21601
## + risks      1   3277949 608561138 21602
## + BMI        1   3170730 608668358 21603
## + MAGER      1   2941408 608897679 21603
## + PRIORDEAD  1    849628 610989460 21609
## <none>                   611839087 21610
## 
## Step:  AIC=21530.53
## DBWT ~ DMAR
## 
##             Df Sum of Sq       RSS   AIC
## + MRACEHISP  3  15359037 567807814 21492
## + risks      1   4395289 578771562 21520
## + BMI        1   3607980 579558871 21522
## + PRIORLIVE  4   4500359 578666492 21526
## + CIG_REC    1    978350 582188501 21530
## + PRIORDEAD  1    818812 582348039 21530
## <none>                   583166851 21530
## + MAGER      1     82475 583084376 21532
## + MEDUC      3    313468 582853384 21536
## - DMAR       1  28672236 611839087 21610
## 
## Step:  AIC=21491.48
## DBWT ~ DMAR + MRACEHISP
## 
##             Df Sum of Sq       RSS   AIC
## + BMI        1   4532441 563275373 21480
## + risks      1   3924653 563883161 21482
## + CIG_REC    1   2181695 565626119 21487
## + PRIORLIVE  4   3880005 563927809 21488
## + PRIORDEAD  1    846700 566961114 21491
## <none>                   567807814 21492
## + MAGER      1     52434 567755381 21493
## + MEDUC      3     98014 567709800 21497
## - MRACEHISP  3  15359037 583166851 21530
## - DMAR       1  15473916 583281730 21535
## 
## Step:  AIC=21479.95
## DBWT ~ DMAR + MRACEHISP + BMI
## 
##             Df Sum of Sq       RSS   AIC
## + risks      1   6059709 557215664 21464
## + CIG_REC    1   2362075 560913299 21475
## + PRIORLIVE  4   3457999 559817375 21478
## + PRIORDEAD  1    896345 562379028 21479
## <none>                   563275373 21480
## + MAGER      1    125890 563149483 21482
## + MEDUC      3    115009 563160364 21486
## - BMI        1   4532441 567807814 21492
## - MRACEHISP  3  16283498 579558871 21522
## - DMAR       1  15325756 578601129 21523
## 
## Step:  AIC=21463.69
## DBWT ~ DMAR + MRACEHISP + BMI + risks
## 
##             Df Sum of Sq       RSS   AIC
## + PRIORLIVE  4   5580134 551635530 21455
## + CIG_REC    1   2199370 555016294 21459
## + PRIORDEAD  1   1025923 556189741 21463
## <none>                   557215664 21464
## + MAGER      1     22588 557193076 21466
## + MEDUC      3     69675 557145989 21470
## - risks      1   6059709 563275373 21480
## - BMI        1   6667497 563883161 21482
## - MRACEHISP  3  15921403 573137067 21505
## - DMAR       1  16431163 573646827 21511
## 
## Step:  AIC=21454.7
## DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE
## 
##             Df Sum of Sq       RSS   AIC
## + CIG_REC    1   2707991 548927539 21448
## + PRIORDEAD  1    921529 550714001 21454
## <none>                   551635530 21455
## + MAGER      1    280671 551354859 21456
## + MEDUC      3    173640 551461890 21460
## - PRIORLIVE  4   5580134 557215664 21464
## - BMI        1   6486820 558122350 21472
## - risks      1   8181845 559817375 21478
## - MRACEHISP  3  15145908 566781438 21494
## - DMAR       1  14456460 566091990 21496
## 
## Step:  AIC=21448.4
## DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + CIG_REC
## 
##             Df Sum of Sq       RSS   AIC
## + PRIORDEAD  1    922493 548005046 21448
## <none>                   548927539 21448
## + MAGER      1    337056 548590483 21449
## + MEDUC      3     75431 548852108 21454
## - CIG_REC    1   2707991 551635530 21455
## - PRIORLIVE  4   6088755 555016294 21459
## - BMI        1   6662694 555590232 21467
## - risks      1   8131760 557059299 21471
## - DMAR       1  10955889 559883427 21480
## - MRACEHISP  3  16576572 565504111 21493
## 
## Step:  AIC=21447.56
## DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + CIG_REC + 
##     PRIORDEAD
## 
##             Df Sum of Sq       RSS   AIC
## <none>                   548005046 21448
## - PRIORDEAD  1    922493 548927539 21448
## + MAGER      1    361028 547644018 21448
## + MEDUC      3     82260 547922786 21453
## - CIG_REC    1   2708955 550714001 21454
## - PRIORLIVE  4   5984714 553989760 21458
## - BMI        1   6749558 554754604 21466
## - risks      1   8234365 556239411 21471
## - DMAR       1  10970232 558975278 21479
## - MRACEHISP  3  16594210 564599256 21492

OptiReg <-Optimodel$terms
M4_optimal <- lm(OptiReg, data = data)
equatiomatic::extract_eq(M4_optimal)

\[ \operatorname{DBWT} = \alpha + \beta_{1}(\operatorname{DMAR}_{\operatorname{Unmarried}}) + \beta_{2}(\operatorname{MRACEHISP}_{\operatorname{NH\ Black}}) + \beta_{3}(\operatorname{MRACEHISP}_{\operatorname{NH\ Other}}) + \beta_{4}(\operatorname{MRACEHISP}_{\operatorname{Hispanic}}) + \beta_{5}(\operatorname{BMI}) + \beta_{6}(\operatorname{risks}_{\operatorname{Yes}}) + \beta_{7}(\operatorname{PRIORLIVE}_{\operatorname{1}}) + \beta_{8}(\operatorname{PRIORLIVE}_{\operatorname{2}}) + \beta_{9}(\operatorname{PRIORLIVE}_{\operatorname{3}}) + \beta_{10}(\operatorname{PRIORLIVE}_{\operatorname{4+}}) + \beta_{11}(\operatorname{CIG\_REC}_{\operatorname{Yes}}) + \beta_{12}(\operatorname{PRIORDEAD}_{\operatorname{1+}}) + \epsilon \]

summary(M4_optimal)

## 
## Call:
## lm(formula = OptiReg, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2751.64  -299.56    29.27   355.17  2394.55 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3080.559     64.625  47.668  < 2e-16 ***
## DMARUnmarried     -184.275     31.823  -5.791 8.36e-09 ***
## MRACEHISPNH Black -292.140     41.690  -7.007 3.51e-12 ***
## MRACEHISPNH Other -121.695     52.796  -2.305 0.021288 *  
## MRACEHISPHispanic  -74.492     38.075  -1.956 0.050577 .  
## BMI                 10.261      2.259   4.542 5.97e-06 ***
## risksYes          -158.808     31.655  -5.017 5.81e-07 ***
## PRIORLIVE1         130.686     33.814   3.865 0.000115 ***
## PRIORLIVE2         107.170     41.746   2.567 0.010339 *  
## PRIORLIVE3         138.776     59.972   2.314 0.020788 *  
## PRIORLIVE4+        131.949     62.333   2.117 0.034420 *  
## CIG_RECYes        -160.083     55.632  -2.878 0.004059 ** 
## PRIORDEAD1+        258.434    153.905   1.679 0.093304 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 572 on 1675 degrees of freedom
## Multiple R-squared:  0.1043, Adjusted R-squared:  0.09791 
## F-statistic: 16.26 on 12 and 1675 DF,  p-value: < 2.2e-16

# Comparition models
 anova(M1_null, M2_unadjusted, M3_full, M4_optimal)

# Displaying the four models by each other
modelsummary(
  list(
    null = M1_null, 
    unadjasted = M2_unadjusted,
    full = M3_full,
    optimal =  M4_optimal
    ), 
  stars = TRUE
  )

	null	unadjasted	full	optimal
(Intercept)	3234.746***	3247.903***	3162.373***	3080.559***
	(14.658)	(15.187)	(107.021)	(64.625)
CIG_RECYes		−179.096**	−156.593**	−160.083**
		(56.032)	(56.325)	(55.632)
risksYes			−152.902***	−158.808***
			(32.055)	(31.655)
MAGER			−3.836
			(3.085)
MRACEHISPNH Black			−290.175***	−292.140***
			(41.829)	(41.690)
MRACEHISPNH Other			−120.774*	−121.695*
			(52.960)	(52.796)
MRACEHISPHispanic			−72.464+	−74.492+
			(39.120)	(38.075)
DMARUnmarried			−189.782***	−184.275***
			(35.084)	(31.823)
MEDUCHS			8.448
			(49.909)
MEDUCSome college			−0.240
			(50.838)
MEDUCUni			31.610
			(56.902)
PRIORLIVE1			140.483***	130.686***
			(34.725)	(33.814)
PRIORLIVE2			123.385**	107.170*
			(43.880)	(41.746)
PRIORLIVE3			161.298*	138.776*
			(63.149)	(59.972)
PRIORLIVE4+			162.508*	131.949*
			(67.562)	(62.333)
PRIORDEAD1+			265.467+	258.434+
			(154.184)	(153.905)
BMI			10.493***	10.261***
			(2.277)	(2.259)
Num.Obs.	1688	1688	1688	1688
R2	0.000	0.006	0.105	0.104
R2 Adj.	0.000	0.005	0.097	0.098
AIC	26401.9	26393.7	26246.1	26239.9
BIC	26412.7	26410.0	26343.8	26315.9
Log.Lik.	−13198.942	−13193.843	−13105.039	−13105.946
F		10.217	12.291	16.259
RMSE	602.05	600.23	569.47	569.78
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

performance::compare_performance(M1_null,M2_unadjusted,M3_full,M4_optimal, rank = T)

Answer: We fit four regression models (null, unadjusted, full and optimal) to explain birth weight (the outcome). To compare the models we used the modelsummary function, where the standard error of each of the regressors per model is shown in brackets.

Finally, R² and AIC are metrics used to select the best model:

Akaike’s information criterion is a goodness-of-fit measure that is corrected for model complexity. A small value represents a better fit of the data.

R² is the proportion of variance shared by the observed values of an outcome and the values of the outcome predicted by a multiple regression model.

In both metrics, the optimal M4 model is the best fit.

References:
- Field, A., Miles, J., & Field, Z. (2017). Discovering statistics using R (p. 992). W. Ross MacDonald School Resource Services Library.

Exercise 4

Center and scale all the continuous variables, and create a forest plot for the optimal model.

# Scaling
natality_ctr <- data %>% 
  standardize(two_sd = TRUE)

# Fitting the model
M4_OPT_ctr <- lm(
  DBWT ~ DMAR + MRACEHISP + BMI + 
    risks + PRIORLIVE + CIG_REC + 
    PRIORDEAD, 
  data = natality_ctr
  )

# creating a forest plot
M4_OPT_ctr %>% 
  model_parameters() %>% 
  plot()

References:
- Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.

Exercise 5

How do the effect estimate of smoking, the standard error of its estimate, its 95% confidence interval, and its statistical significance change between the un-adjusted, optimal and full models? Please use the centered and scaled models to answer this question.

M2_unadjusted_ctr <- lm(DBWT ~ CIG_REC, natality_ctr)

M3_full_ctr <- lm(DBWT ~ ., natality_ctr)

M4_optimal_ctr <- lm( DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + CIG_REC + PRIORDEAD,
  data = natality_ctr )

M2_unadjusted_ctr %>%  tidy(conf.int = TRUE) %>%  filter(term == "CIG_RECYes")

M3_full_ctr %>% tidy(conf.int = TRUE) %>%  filter(term == "CIG_RECYes")

M4_optimal_ctr %>% tidy(conf.int = TRUE) %>%  filter(term == "CIG_RECYes")

Answer: In all cases the estimate is negative, varies little between models, and is significant in all models.

Exercise 6

Does the association between cigarette smoking during pregnancy and birth weight depend on mother’s education? Modify the FULL model appropriately to be able to answer this question. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, create a dumbell plot illustrating the smoking effect at each level of mother’s education and estimate the cigarette smoking effect at each level of mother’s education. Hint: When creating the dataset for plotting, include the value zero for each predictor not included in the interaction (the reference level for all categorical variables). Picking the value zero in a centered predictor is like choosing the mean of that variable, which is a reasonable choice.

# Full model  with interaction effect

M3_full_int_ctr <- lm(formula = DBWT ~ . + CIG_REC * MEDUC, data = natality_ctr) 
summary(M3_full_int_ctr)

## 
## Call:
## lm(formula = DBWT ~ . + CIG_REC * MEDUC, data = natality_ctr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27704 -0.24697  0.02151  0.29953  1.98393 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.06227    0.05010   1.243  0.21402    
## CIG_RECYes                   -0.01780    0.09114  -0.195  0.84520    
## risksYes                     -0.12659    0.02664  -4.752 2.18e-06 ***
## MAGER                        -0.03616    0.02977  -1.215  0.22466    
## MRACEHISPNH Black            -0.23940    0.03476  -6.888 7.98e-12 ***
## MRACEHISPNH Other            -0.10225    0.04402  -2.323  0.02030 *  
## MRACEHISPHispanic            -0.05612    0.03260  -1.721  0.08539 .  
## DMARUnmarried                -0.15687    0.02915  -5.382 8.40e-08 ***
## MEDUCHS                       0.02627    0.04459   0.589  0.55580    
## MEDUCSome college             0.02550    0.04503   0.566  0.57128    
## MEDUCUni                      0.04447    0.04899   0.908  0.36418    
## PRIORLIVE1                    0.11737    0.02884   4.070 4.93e-05 ***
## PRIORLIVE2                    0.10388    0.03646   2.850  0.00443 ** 
## PRIORLIVE3                    0.12858    0.05261   2.444  0.01463 *  
## PRIORLIVE4+                   0.13510    0.05610   2.408  0.01615 *  
## PRIORDEAD1+                   0.22467    0.12811   1.754  0.07965 .  
## BMI                           0.10834    0.02394   4.526 6.45e-06 ***
## CIG_RECYes:MEDUCHS           -0.12090    0.11702  -1.033  0.30171    
## CIG_RECYes:MEDUCSome college -0.20235    0.12165  -1.663  0.09645 .  
## CIG_RECYes:MEDUCUni          -0.06490    0.19249  -0.337  0.73605    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4752 on 1668 degrees of freedom
## Multiple R-squared:  0.1068, Adjusted R-squared:  0.09665 
## F-statistic:  10.5 on 19 and 1668 DF,  p-value: < 2.2e-16

# Comparition models
 compARE <- anova( M3_full_ctr, M3_full_int_ctr)
 

# Dumbell Plot 
# To create the dumbell figure, you will need to predict the birth weight for smoking and non-smoking mums with different education levels.
 
 
M3_full_int_ctr <- lm(formula = DBWT ~ . + CIG_REC * MEDUC, data = natality_ctr)

predictions(M3_full_int_ctr, 
            newdata = datagrid(
              model = M3_full_int_ctr,
              #MRACEHISP = c("NH White", "NH Black", "NH Other", "Hispanic"),
              MEDUC   = c("lower", "HS", "Some college", "Uni"),
              CIG_REC = c("No", "Yes")
              )
            ) %>% 
  as.tibble() %>% 
  mutate(DBWT = estimate)  %>% 
  select(DBWT, CIG_REC, MEDUC)

## Warning: `as.tibble()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` instead.
## ℹ The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

tribble(
  ~Education, ~No_Smoking, ~Smoking, 
  "Lower",    0.06227469,  0.04447809, 
  "HS",       0.08854779,  -0.05014673, 
  "Some college", 0.08854779, -0.13236808   ,
  "Uni", 0.10674030,0.02404656  ) %>%  ggplot() + 
  geom_segment(
    aes(
      x = "No Smoking", y = No_Smoking, 
      xend = "Smoking", yend = Smoking, 
      color = Education ),
    linewidth = 2) + 
  geom_point(
    aes(x = "No Smoking", y = No_Smoking, color = Education), 
    size = 6) + 
  geom_point(
    aes(x = "Smoking", y = Smoking, color = Education), 
    size = 6) +
  labs(y="Birth weight (centered & scaled)", 
         x="Smoked during pregnancy ")

Answer: After comparing two models we did not find significant statistical difference between the full model with interaction and the full model without interaction.

Exercise 7

Does the association between mother’s BMI and birth-weight differ between mothers of different race/ethnicity? Modify the OPTIMAL model appropriately to be able to answer this question, using the dataset with the centered variables. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, estimate the BMI effect at each level of mother’s race/ethnicity and plot the BMI effect at each level of mother’s race/ethnicity using a dumbell plot.

# Full model  with interaction effect


M4_OPT__BMIxMRACEHISP_ctr <- lm(
  DBWT ~ DMAR + MRACEHISP + BMI + 
    risks + PRIORLIVE + CIG_REC + 
    PRIORDEAD + + BMI *MRACEHISP, 
  data = natality_ctr
  )
summary(M4_OPT__BMIxMRACEHISP_ctr )

## 
## Call:
## lm(formula = DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + 
##     CIG_REC + PRIORDEAD + +BMI * MRACEHISP, data = natality_ctr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25450 -0.24913  0.02481  0.30091  1.97315 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.10284    0.02388   4.306 1.76e-05 ***
## DMARUnmarried         -0.15462    0.02645  -5.845 6.07e-09 ***
## MRACEHISPNH Black     -0.23951    0.03478  -6.887 8.05e-12 ***
## MRACEHISPNH Other     -0.09206    0.04430  -2.078 0.037876 *  
## MRACEHISPHispanic     -0.06069    0.03165  -1.917 0.055353 .  
## BMI                    0.11999    0.03235   3.710 0.000214 ***
## risksYes              -0.13234    0.02631  -5.031 5.41e-07 ***
## PRIORLIVE1             0.10773    0.02808   3.837 0.000129 ***
## PRIORLIVE2             0.08952    0.03466   2.582 0.009894 ** 
## PRIORLIVE3             0.11519    0.04984   2.311 0.020940 *  
## PRIORLIVE4+            0.10850    0.05181   2.094 0.036406 *  
## CIG_RECYes            -0.13550    0.04621  -2.932 0.003412 ** 
## PRIORDEAD1+            0.20701    0.12791   1.618 0.105753    
## MRACEHISPNH Black:BMI -0.05032    0.05884  -0.855 0.392561    
## MRACEHISPNH Other:BMI  0.12036    0.09825   1.225 0.220711    
## MRACEHISPHispanic:BMI -0.04574    0.06290  -0.727 0.467270    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4749 on 1672 degrees of freedom
## Multiple R-squared:  0.106,  Adjusted R-squared:  0.098 
## F-statistic: 13.22 on 15 and 1672 DF,  p-value: < 2.2e-16

# Comparition models
 anova(M4_optimal, M4_OPT__BMIxMRACEHISP_ctr)

# Dumbell Plot 
# To create the dumbell figure, you will need to predict the birth weight for smoking and non-smoking mums with different education levels.
 

predictions(M4_OPT__BMIxMRACEHISP_ctr, 
            newdata = datagrid(
              model = M4_OPT__BMIxMRACEHISP_ctr,
              MRACEHISP = c("NH White", "NH Black", "NH Other", "Hispanic"),
              BMI = c("mean(BMI)", "mean(BMI) + sd(BMI)")
              )
            ) %>% 
  as_tibble() %>% 
  mutate(DBWT = estimate)  %>% 
  select(DBWT, BMI, MRACEHISP)

tribble(
  ~ MRACEHISP, ~mean_BMI, ~meansd_BMI,
  "NH White",  0.103,  0.232 ,
  "NH Black",  -0.137,  -0.0670,
  "NH Other",   0.0108, 0.251   ,
  "Hispanic",   0.0422,0.116    ) %>%  ggplot() +
  geom_segment(
    aes(
      x = "mean(BMI)", y =mean_BMI,
      xend = "mean(BMI) + sd(BMI)", yend = meansd_BMI,
      color = MRACEHISP ),
    linewidth = 2) +
  geom_point(
    aes(x = "mean(BMI)", y = mean_BMI, color = MRACEHISP),
    size = 6) +
  geom_point(
    aes(x = "mean(BMI) + sd(BMI)", y = meansd_BMI, color = MRACEHISP),
    size = 6) +
  labs(y="Birth weight (centered & scaled)",
         x="BMI ")

Answer: Comparing with children from White mothers, the birth weight is lower in children from Black mothers; comparing with children from White mothers, the birth weight is lower in children from Hispanic mothers; Comparing with children from White mothers, the birth weight is higher in children from mothers other ethnicity. However, there is no significant difference between the birth weight of children with White mothers and the birth weight of children from mothers with Black, Hispanic and other ethnicity.

Exercise 8

Does the association between mother’s BMI and birth-weight depend on mother’s age? Modify the FULL model appropriately to be able to answer this question, using the dataset with the centered variables. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, estimate the BMI effect for mothers age 20, 30, and 40 years and make a plot illustrating the BMI effect at each of these ages.

# Full model  with interaction effect

M3_full_BMIxMAGER_ctr <- lm(formula = DBWT ~ . + BMI * MAGER, data = natality_ctr) 
summary(M3_full_BMIxMAGER_ctr)

## 
## Call:
## lm(formula = DBWT ~ . + BMI * MAGER, data = natality_ctr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27538 -0.25308  0.02022  0.29794  1.98582 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.0830972  0.0483016   1.720  0.08555 .  
## CIG_RECYes        -0.1308047  0.0467996  -2.795  0.00525 ** 
## risksYes          -0.1266568  0.0266254  -4.757 2.13e-06 ***
## MAGER             -0.0370333  0.0297274  -1.246  0.21303    
## MRACEHISPNH Black -0.2397811  0.0348058  -6.889 7.93e-12 ***
## MRACEHISPNH Other -0.1008122  0.0439924  -2.292  0.02205 *  
## MRACEHISPHispanic -0.0593407  0.0325253  -1.824  0.06826 .  
## DMARUnmarried     -0.1577950  0.0291384  -5.415 7.01e-08 ***
## MEDUCHS            0.0067930  0.0414481   0.164  0.86984    
## MEDUCSome college -0.0004223  0.0422197  -0.010  0.99202    
## MEDUCUni           0.0251011  0.0473046   0.531  0.59575    
## PRIORLIVE1         0.1162589  0.0288463   4.030 5.82e-05 ***
## PRIORLIVE2         0.1016648  0.0364701   2.788  0.00537 ** 
## PRIORLIVE3         0.1339035  0.0524410   2.553  0.01076 *  
## PRIORLIVE4+        0.1352514  0.0561094   2.410  0.01604 *  
## PRIORDEAD1+        0.2172404  0.1281854   1.695  0.09031 .  
## BMI                0.1099342  0.0239239   4.595 4.65e-06 ***
## MAGER:BMI         -0.0242484  0.0469027  -0.517  0.60523    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4753 on 1670 degrees of freedom
## Multiple R-squared:  0.1054, Adjusted R-squared:  0.09633 
## F-statistic: 11.58 on 17 and 1670 DF,  p-value: < 2.2e-16

# Comparition models
 anova( M4_optimal, M3_full_BMIxMAGER_ctr)

# Dumbell Plot 
# To create the dumbell figure, you will need to predict the birth weight for smoking and non-smoking mums with different education levels.
 
 
M3_full_BMIxMAGER_ctr <- lm(formula = DBWT ~ . + BMI * MAGER, data = natality_ctr)

predictions(M3_full_BMIxMAGER_ctr, 
            newdata = datagrid(
              model = M3_full_BMIxMAGER_ctr,
              BMI = c("mean(BMI)", "mean(BMI) + sd(BMI)"),
              MAGER   = c(20,30,40))
            ) %>% 
  as.tibble() %>% 
  mutate(DBWT = estimate)  %>% 
  select(DBWT,BMI, MAGER)

tribble(
  ~Mother_age, ~No_Smoking, ~Smoking,
  "20",   -0.632,  -1.01,
  "30",   -1.00,  -1.62,
  "40",   -1.37, -2.23) %>%  ggplot() +
  geom_segment(
    aes(
      x = "mean(BMI)", y = No_Smoking,
      xend = "mean(BMI) + sd(BMI)", yend = Smoking,
      color = Mother_age ),
    linewidth = 2) +
  geom_point(
    aes(x = "mean(BMI)", y = No_Smoking, color = Mother_age),
    size = 6) +
  geom_point(
    aes(x = "mean(BMI) + sd(BMI)", y = Smoking, color = Mother_age),
    size = 6) +
  labs(y="Birth weight (centered & scaled)",
         x="Mother´s age ")

Answer: The association between mother´s BMI and birth-weight did not depended on mother´age. For the model´s perfomance, there is no significant differences between the full model with interaction and the full model without interaction.