Lab 5: Multiple Linear Regression

Exercise 1

Question: Use the table1 package to create a table with appropriate descriptive information about the data set. Use the examples provided in the following vignette to help you build a table. What do you notice from this table? What research questions might it raise?

library(tidyverse)
library(broom)
library(table1)
library(car)
library(GGally)
library(easystats)
library(datawizard)
library(marginaleffects)
library(modelsummary)

natality <- read_rds("data/natality.rds")

glimpse(natality)

## Rows: 1,688
## Columns: 10
## $ DBWT      <labelled> 3289, 907, 3447, 3030, 3441, 2950, 2812, 4210, 2810, 36…
## $ CIG_REC   <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, …
## $ risks     <fct> Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, Yes, N…
## $ MAGER     <labelled> 35, 28, 22, 19, 30, 20, 41, 34, 35, 28, 36, 27, 24, 33,…
## $ MRACEHISP <fct> Hispanic, NH White, NH Black, Hispanic, NH White, NH Other, …
## $ DMAR      <fct> Married, Unmarried, Unmarried, Unmarried, Married, Unmarried…
## $ MEDUC     <fct> Uni, lower, HS, HS, Uni, Some college, Some college, Uni, So…
## $ PRIORLIVE <fct> 1, 2, 0, 1, 1, 0, 0, 1, 1, 2, 0, 0, 1, 3, 0, 3, 1, 1, 4+, 3,…
## $ PRIORDEAD <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ BMI       <labelled> 24.9, 27.9, 26.8, 23.6, 33.3, 33.3, 29.9, 25.1, 22.3, 1…

label(natality$DBWT) <- "Birth Weight (tones)"
table1(~DBWT +risks + MAGER + MRACEHISP+DMAR+MEDUC+PRIORLIVE+PRIORDEAD+BMI| CIG_REC, data = natality, overall = FALSE)

	No (N=1564)	Yes (N=124)
Birth Weight (tones)
Mean (SD)	3250 (601)	3070 (597)
Median [Min, Max]	3290 [369, 5880]	3090 [630, 4220]
Risk Factors Reported
No	1088 (69.6%)	84 (67.7%)
Yes	476 (30.4%)	40 (32.3%)
Mother Age
Mean (SD)	29.0 (5.79)	27.8 (5.79)
Median [Min, Max]	29.0 [15.0, 49.0]	28.0 [17.0, 42.0]
Mother Race/Hispanic Origin
NH White	844 (54.0%)	94 (75.8%)
NH Black	266 (17.0%)	14 (11.3%)
NH Other	125 (8.0%)	10 (8.1%)
Hispanic	329 (21.0%)	6 (4.8%)
Marital Status
Married	1017 (65.0%)	40 (32.3%)
Unmarried	547 (35.0%)	84 (67.7%)
Mother Education
lower	177 (11.3%)	34 (27.4%)
HS	377 (24.1%)	45 (36.3%)
Some college	437 (27.9%)	37 (29.8%)
Uni	573 (36.6%)	8 (6.5%)
PRIORLIVE
0	601 (38.4%)	38 (30.6%)
1	516 (33.0%)	41 (33.1%)
2	265 (16.9%)	17 (13.7%)
3	93 (5.9%)	17 (13.7%)
4+	89 (5.7%)	11 (8.9%)
PRIORDEAD
0	1551 (99.2%)	123 (99.2%)
1+	13 (0.8%)	1 (0.8%)
Mother BMI
Mean (SD)	26.9 (6.32)	27.3 (6.42)
Median [Min, Max]	25.6 [15.6, 58.4]	26.3 [16.5, 50.1]

Answer: The table shows the descriptive statistics stratified by smoking status during pregnancy. There’s a clear trend of lower birth weights among infants born to mothers who reported smoking during pregnancy compared to those who did not. This aligns with established findings on the negative impact of smoking on fetal development and reduced birth weight. Furthermore, there are consistent differences in demographic characteristics between the two groups, with higher proportions of younger, unmarried, and less educated mothers among smokers. This suggests potential socioeconomic disparities in smoking behavior during pregnancy.

There appears to be another trend where smokers tend to have fewer prior live births and dead births compared to non-smokers. This could suggest potential differences in reproductive history between the two groups, possibly influenced by lifestyle factors such as smoking.

As for BMI, there is a slight variation between smokers and non-smokers, with smokers tending to have a slightly lower mean BMI. This might reflect differences in health behaviors and metabolic profiles between the two groups, warranting further investigation into the relationship between smoking status and body weight during pregnancy.

All these observations can lead us to several possible questions:**

How does smoking during pregnancy impact birth weight, and what are the potential mediating factors such as maternal age, race/ethnicity, and education level?
Can socioeconomic factors such as marital status and education level predict smoking behavior during pregnancy, and how do these factors interact with other variables such as maternal age and race/ethnicity?
Are there significant differences in the number of prior live births and dead births between smokers and non-smokers, and how do these differences relate to birth outcomes?
Are there significant differences in the number of prior live births and dead births between smokers and non-smokers, and how do these differences relate to birth outcomes?

Exercise 2

Question: Use the ggpairs function to explore the association between the different variables. Try to replicate the figure below, or create your own variation. What do you see in the figure? What new information does it provide? What are the advantages/disadvantages when comparing table1 to ggpairs?

# Write your code here
ggpairs(natality, c(1,3,4,6,10),
        mapping=aes(color=CIG_REC), 
        upper=list(continuous=wrap("density", alpha=0.9),combo="facethist",discrete=wrap('colbar')), 
        lower=list(continuous=wrap("cor"), combo=NULL), 
        diag=list(continuous=wrap("densityDiag", color=NA, alpa=0.6), discrete=wrap('densityDiag')), 
        legend=c(3,5),
        #columnLabels= labels, 
        title='Natality in the United States(2018)',
        showStrips=FALSE,
        cardinality_threschold=5)+ labs(colors='Smoking')+theme_minimal()+theme(legend.position='bottom')

Answer: The use of the ggpairs function gives a great visualization of the associations between different variables in our data set and it even includes the correlation coefficients. This visual presentation can easily reveal patterns, trends and potential outliers without having to create a large amount of different graphs separately from each other.

Advantages:

Utilizing the visual representation provided by the ggpairs function enables good comprehension of relationships between variables. This can facilitate the detection of outliers and nuanced patterns that might escape notice within tabular summaries. Additionally, scatterplots are particularly useful at revealing non-linear relationships, offering a better insight into complex interactions among our variables.

Disadvantages:

Using visual representations can make comparing specific values between variables a lot more challenging than with tabular formats. Moreover, interpretation may become subjective, particularly when confronted with a large number of variables, potentially leading to varying interpretations of the data. Furthermore, there’s more constraint in customizing the output compared to tabular formats, which offer more flexibility in presenting precise data points and summaries.

Exercise 3

Question: Fit four linear models, each nested within the other: the null model, the unadjusted model, the full model and the optimal model. Use the step function find the optimal model using the stepwise method. Then use anova to compare the four models. What is your conclusion? Now compare your models using modelsummary function. What does the number in the (brackets) mean? How can you identify the optimal model just by looking at the R^2 and the AIC?

M.NULL <- lm(DBWT ~ 1, natality)
M.UNADJUSTED <- lm(DBWT ~ CIG_REC, natality)

# The dot represents all variables except the outcome DBWT
M.FULL <- lm(DBWT ~ ., natality)  

# Finding the optimal model
step(M.NULL,
     scope = list(upper = M.FULL),
     direction="both",
     data=natality)

## Start:  AIC=21609.55
## DBWT ~ 1
## 
##             Df Sum of Sq       RSS   AIC
## + DMAR       1  28672236 583166851 21530
## + MRACEHISP  3  28557357 583281730 21535
## + PRIORLIVE  4   7485192 604353895 21597
## + MEDUC      3   5189885 606649203 21601
## + CIG_REC    1   3685187 608153901 21601
## + risks      1   3277949 608561138 21602
## + BMI        1   3170730 608668358 21603
## + MAGER      1   2941408 608897679 21603
## + PRIORDEAD  1    849628 610989460 21609
## <none>                   611839087 21610
## 
## Step:  AIC=21530.53
## DBWT ~ DMAR
## 
##             Df Sum of Sq       RSS   AIC
## + MRACEHISP  3  15359037 567807814 21492
## + risks      1   4395289 578771562 21520
## + BMI        1   3607980 579558871 21522
## + PRIORLIVE  4   4500359 578666492 21526
## + CIG_REC    1    978350 582188501 21530
## + PRIORDEAD  1    818812 582348039 21530
## <none>                   583166851 21530
## + MAGER      1     82475 583084376 21532
## + MEDUC      3    313468 582853384 21536
## - DMAR       1  28672236 611839087 21610
## 
## Step:  AIC=21491.48
## DBWT ~ DMAR + MRACEHISP
## 
##             Df Sum of Sq       RSS   AIC
## + BMI        1   4532441 563275373 21480
## + risks      1   3924653 563883161 21482
## + CIG_REC    1   2181695 565626119 21487
## + PRIORLIVE  4   3880005 563927809 21488
## + PRIORDEAD  1    846700 566961114 21491
## <none>                   567807814 21492
## + MAGER      1     52434 567755381 21493
## + MEDUC      3     98014 567709800 21497
## - MRACEHISP  3  15359037 583166851 21530
## - DMAR       1  15473916 583281730 21535
## 
## Step:  AIC=21479.95
## DBWT ~ DMAR + MRACEHISP + BMI
## 
##             Df Sum of Sq       RSS   AIC
## + risks      1   6059709 557215664 21464
## + CIG_REC    1   2362075 560913299 21475
## + PRIORLIVE  4   3457999 559817375 21478
## + PRIORDEAD  1    896345 562379028 21479
## <none>                   563275373 21480
## + MAGER      1    125890 563149483 21482
## + MEDUC      3    115009 563160364 21486
## - BMI        1   4532441 567807814 21492
## - MRACEHISP  3  16283498 579558871 21522
## - DMAR       1  15325756 578601129 21523
## 
## Step:  AIC=21463.69
## DBWT ~ DMAR + MRACEHISP + BMI + risks
## 
##             Df Sum of Sq       RSS   AIC
## + PRIORLIVE  4   5580134 551635530 21455
## + CIG_REC    1   2199370 555016294 21459
## + PRIORDEAD  1   1025923 556189741 21463
## <none>                   557215664 21464
## + MAGER      1     22588 557193076 21466
## + MEDUC      3     69675 557145989 21470
## - risks      1   6059709 563275373 21480
## - BMI        1   6667497 563883161 21482
## - MRACEHISP  3  15921403 573137067 21505
## - DMAR       1  16431163 573646827 21511
## 
## Step:  AIC=21454.7
## DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE
## 
##             Df Sum of Sq       RSS   AIC
## + CIG_REC    1   2707991 548927539 21448
## + PRIORDEAD  1    921529 550714001 21454
## <none>                   551635530 21455
## + MAGER      1    280671 551354859 21456
## + MEDUC      3    173640 551461890 21460
## - PRIORLIVE  4   5580134 557215664 21464
## - BMI        1   6486820 558122350 21472
## - risks      1   8181845 559817375 21478
## - MRACEHISP  3  15145908 566781438 21494
## - DMAR       1  14456460 566091990 21496
## 
## Step:  AIC=21448.4
## DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + CIG_REC
## 
##             Df Sum of Sq       RSS   AIC
## + PRIORDEAD  1    922493 548005046 21448
## <none>                   548927539 21448
## + MAGER      1    337056 548590483 21449
## + MEDUC      3     75431 548852108 21454
## - CIG_REC    1   2707991 551635530 21455
## - PRIORLIVE  4   6088755 555016294 21459
## - BMI        1   6662694 555590232 21467
## - risks      1   8131760 557059299 21471
## - DMAR       1  10955889 559883427 21480
## - MRACEHISP  3  16576572 565504111 21493
## 
## Step:  AIC=21447.56
## DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + CIG_REC + 
##     PRIORDEAD
## 
##             Df Sum of Sq       RSS   AIC
## <none>                   548005046 21448
## - PRIORDEAD  1    922493 548927539 21448
## + MAGER      1    361028 547644018 21448
## + MEDUC      3     82260 547922786 21453
## - CIG_REC    1   2708955 550714001 21454
## - PRIORLIVE  4   5984714 553989760 21458
## - BMI        1   6749558 554754604 21466
## - risks      1   8234365 556239411 21471
## - DMAR       1  10970232 558975278 21479
## - MRACEHISP  3  16594210 564599256 21492

## 
## Call:
## lm(formula = DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + 
##     CIG_REC + PRIORDEAD, data = natality)
## 
## Coefficients:
##       (Intercept)      DMARUnmarried  MRACEHISPNH Black  MRACEHISPNH Other  
##           3080.56            -184.27            -292.14            -121.70  
## MRACEHISPHispanic                BMI           risksYes         PRIORLIVE1  
##            -74.49              10.26            -158.81             130.69  
##        PRIORLIVE2         PRIORLIVE3        PRIORLIVE4+         CIG_RECYes  
##            107.17             138.78             131.95            -160.08  
##       PRIORDEAD1+  
##            258.43

# Displaying the four models by each other
M.OPT <- lm(DBWT ~ DMAR + MRACEHISP + BMI + risks + PRIORLIVE + CIG_REC + PRIORDEAD, natality)

modelsummary(
  list(
    null = M.NULL, 
    unadjasted = M.UNADJUSTED, 
    OPT = M.OPT , 
    Full = M.FULL
    ), 
  stars = TRUE )

	null	unadjasted	OPT	Full
(Intercept)	3234.746***	3247.903***	3080.559***	3162.373***
	(14.658)	(15.187)	(64.625)	(107.021)
CIG_RECYes		−179.096**	−160.083**	−156.593**
		(56.032)	(55.632)	(56.325)
DMARUnmarried			−184.275***	−189.782***
			(31.823)	(35.084)
MRACEHISPNH Black			−292.140***	−290.175***
			(41.690)	(41.829)
MRACEHISPNH Other			−121.695*	−120.774*
			(52.796)	(52.960)
MRACEHISPHispanic			−74.492+	−72.464+
			(38.075)	(39.120)
BMI			10.261***	10.493***
			(2.259)	(2.277)
risksYes			−158.808***	−152.902***
			(31.655)	(32.055)
PRIORLIVE1			130.686***	140.483***
			(33.814)	(34.725)
PRIORLIVE2			107.170*	123.385**
			(41.746)	(43.880)
PRIORLIVE3			138.776*	161.298*
			(59.972)	(63.149)
PRIORLIVE4+			131.949*	162.508*
			(62.333)	(67.562)
PRIORDEAD1+			258.434+	265.467+
			(153.905)	(154.184)
MAGER				−3.836
				(3.085)
MEDUCHS				8.448
				(49.909)
MEDUCSome college				−0.240
				(50.838)
MEDUCUni				31.610
				(56.902)
Num.Obs.	1688	1688	1688	1688
R2	0.000	0.006	0.104	0.105
R2 Adj.	0.000	0.005	0.098	0.097
AIC	26401.9	26393.7	26239.9	26246.1
BIC	26412.7	26410.0	26315.9	26343.8
Log.Lik.	−13198.942	−13193.843	−13105.946	−13105.039
F		10.217	16.259	12.291
RMSE	602.05	600.23	569.78	569.47
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Answer: We can identify the optimal model by looking at the R2 and the AIC because a higher R² suggests a better fit of the model to the data, while a lower AIC indicates a better balance between model fit and complexity. Therefore, to identify the optimal model using R² and AIC, we should look for the model with the highest R² and lowest AIC. In our case, the optimal model seems to be the “OPT” model, as it has the highest R² (0.104) and lowest AIC (26239.9) among the compared models.

Furthermore, the numbers in the brackets following the coefficient estimates represent the SEs of the coefficients. Smaller values indicate greater precision in the estimation of the coefficients, suggesting higher confidence in the parameter estimates.

Exercise 4:

Question: Center and scale all the continuous variables, and create a forest plot for the optimal model.

# Scaling
natality.ctr <- natality |> 
  standardize(two_sd = TRUE)

# Fitting the model
M.OPT.ctr <- lm(
  DBWT ~ DMAR + MRACEHISP + BMI + 
    risks + PRIORLIVE + CIG_REC + 
    PRIORDEAD, 
  data = natality.ctr
  )

# creating a forest plot
M.OPT.ctr %>% 
  model_parameters() %>% 
  plot()

Exercise 5:

Question:How do the effect estimate of smoking, the standard error of its estimate, its 95% confidence interval, and its statistical significance change between the un-adjusted, optimal and full models? Please use the centered and scaled models to answer this question.

M.UNADJUSTED.ctr <- lm(DBWT ~ CIG_REC, natality.ctr)
M.FULL.ctr <- lm(DBWT ~ ., natality.ctr) 

M.UNADJUSTED.ctr |> 
  tidy(conf.int = TRUE)

## # A tibble: 2 × 7
##   term        estimate std.error statistic p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)   0.0109    0.0126     0.866 0.386    -0.0138    0.0357
## 2 CIG_RECYes   -0.149     0.0465    -3.20  0.00142  -0.240    -0.0575

M.OPT.ctr |> 
  tidy(conf.int = TRUE)

## # A tibble: 13 × 7
##    term              estimate std.error statistic  p.value conf.low conf.high
##    <chr>                <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 (Intercept)         0.101     0.0238      4.25 2.24e- 5  0.0546   0.148   
##  2 DMARUnmarried      -0.153     0.0264     -5.79 8.36e- 9 -0.205   -0.101   
##  3 MRACEHISPNH Black  -0.243     0.0346     -7.01 3.51e-12 -0.310   -0.175   
##  4 MRACEHISPNH Other  -0.101     0.0438     -2.30 2.13e- 2 -0.187   -0.0151  
##  5 MRACEHISPHispanic  -0.0618    0.0316     -1.96 5.06e- 2 -0.124    0.000156
##  6 BMI                 0.108     0.0237      4.54 5.97e- 6  0.0612   0.154   
##  7 risksYes           -0.132     0.0263     -5.02 5.81e- 7 -0.183   -0.0803  
##  8 PRIORLIVE1          0.109     0.0281      3.86 1.15e- 4  0.0534   0.164   
##  9 PRIORLIVE2          0.0890    0.0347      2.57 1.03e- 2  0.0210   0.157   
## 10 PRIORLIVE3          0.115     0.0498      2.31 2.08e- 2  0.0176   0.213   
## 11 PRIORLIVE4+         0.110     0.0518      2.12 3.44e- 2  0.00805  0.211   
## 12 CIG_RECYes         -0.133     0.0462     -2.88 4.06e- 3 -0.224   -0.0423  
## 13 PRIORDEAD1+         0.215     0.128       1.68 9.33e- 2 -0.0361   0.465

modelsummary(
  list(
    unadjasted = M.UNADJUSTED.ctr, 
    full = M.FULL.ctr, 
    opt = M.OPT.ctr
    ), 
  stars = TRUE
  )

	unadjasted	full	opt
(Intercept)	0.011	0.082+	0.101***
	(0.013)	(0.048)	(0.024)
CIG_RECYes	−0.149**	−0.130**	−0.133**
	(0.047)	(0.047)	(0.046)
risksYes		−0.127***	−0.132***
		(0.027)	(0.026)
MAGER		−0.037
		(0.030)
MRACEHISPNH Black		−0.241***	−0.243***
		(0.035)	(0.035)
MRACEHISPNH Other		−0.100*	−0.101*
		(0.044)	(0.044)
MRACEHISPHispanic		−0.060+	−0.062+
		(0.032)	(0.032)
DMARUnmarried		−0.158***	−0.153***
		(0.029)	(0.026)
MEDUCHS		0.007
		(0.041)
MEDUCSome college		0.000
		(0.042)
MEDUCUni		0.026
		(0.047)
PRIORLIVE1		0.117***	0.109***
		(0.029)	(0.028)
PRIORLIVE2		0.102**	0.089*
		(0.036)	(0.035)
PRIORLIVE3		0.134*	0.115*
		(0.052)	(0.050)
PRIORLIVE4+		0.135*	0.110*
		(0.056)	(0.052)
PRIORDEAD1+		0.220+	0.215+
		(0.128)	(0.128)
BMI		0.110***	0.108***
		(0.024)	(0.024)
Num.Obs.	1688	1688	1688
R2	0.006	0.105	0.104
R2 Adj.	0.005	0.097	0.098
AIC	2445.1	2297.5	2291.3
BIC	2461.4	2395.2	2367.3
Log.Lik.	−1219.537	−1130.732	−1131.640
F	10.217	12.291	16.259
RMSE	0.50	0.47	0.47
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

results_df <- data.frame(
  Model = c("M.FULL.ctr", "M.OPT.ctr", "M.UNADJUSTED.ctr"),
  CIG_RECYes = c(coef(M.FULL.ctr)["CIG_RECYes"], coef(M.OPT.ctr)["CIG_RECYes"], coef(M.UNADJUSTED.ctr)["CIG_RECYes"]),
  std.error = c(summary(M.FULL.ctr)$coef["CIG_RECYes", "Std. Error"],
                summary(M.OPT.ctr)$coef["CIG_RECYes", "Std. Error"],
                summary(M.UNADJUSTED.ctr)$coef["CIG_RECYes", "Std. Error"]),
  statistic = c(summary(M.FULL.ctr)$coef["CIG_RECYes", "t value"],
                summary(M.OPT.ctr)$coef["CIG_RECYes", "t value"],
                summary(M.UNADJUSTED.ctr)$coef["CIG_RECYes", "t value"]),
  p.value = c(summary(M.FULL.ctr)$coef["CIG_RECYes", "Pr(>|t|)"],
               summary(M.OPT.ctr)$coef["CIG_RECYes", "Pr(>|t|)"],
               summary(M.UNADJUSTED.ctr)$coef["CIG_RECYes", "Pr(>|t|)"]),
  conf.low = c(confint(M.FULL.ctr)["CIG_RECYes", "2.5 %"],
               confint(M.OPT.ctr)["CIG_RECYes", "2.5 %"],
               confint(M.UNADJUSTED.ctr)["CIG_RECYes", "2.5 %"]),
  conf.high = c(confint(M.FULL.ctr)["CIG_RECYes", "97.5 %"],
                confint(M.OPT.ctr)["CIG_RECYes", "97.5 %"],
                confint(M.UNADJUSTED.ctr)["CIG_RECYes", "97.5 %"])
)

# Print the table using kable
library(knitr)
kable(results_df, format = "markdown")

Model	CIG_RECYes	std.error	statistic	p.value	conf.low	conf.high
M.FULL.ctr	-0.1300115	0.0467642	-2.780148	0.0054943	-0.2217341	-0.0382889
M.OPT.ctr	-0.1329085	0.0461888	-2.877505	0.0040591	-0.2235024	-0.0423147
M.UNADJUSTED.ctr	-0.1486948	0.0465204	-3.196331	0.0014177	-0.2399387	-0.0574509

Answer: 1. We observe that the effect estimate of smoking is negative in all models. The effect estimate decreases slightly from the unadjusted to the full model.

The standard error remains relatively consistent across the models
The confidence intervals overlap between the models, indicating no significant difference in the precision of the estimate.
The p-values decrease from the unadjusted to the full model, indicating increasing statistical significance.

While the effect estimate of smoking slightly decreases from the unadjusted to the full model, the standard error remains consistent. The confidence intervals overlap between the models, suggesting no significant difference in precision. However, the statistical significance increases, as evidenced by decreasing p-values, indicating stronger evidence against the null hypothesis in the full model compared to the unadjusted model

Exercise 6:

Question: Does the association between cigarette smoking during pregnancy and birth weight depend on mother’s education? Modify the FULL model appropriately to be able to answer this question. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, create ONE of the dumbell plots shown below, illustrating the smoking effect at each level of mother’s education and estimate the cigarette smoking effect at each level of mother’s education. Hint: We will be using the marginaleffects package to create the plots.

  # Write your code here
M.FULL_int.ctr <- lm(
  formula = DBWT ~ . + CIG_REC * MEDUC,
  data = natality.ctr
)

library(marginaleffects)
M.FULL_int.ctr <- lm(formula = DBWT ~ . + CIG_REC * MEDUC, data = natality.ctr)

predictions(M.FULL_int.ctr, 
            newdata = datagrid(
              model = M.FULL_int.ctr,
              MRACEHISP = c("NH White", "NH Black", "NH Other", "Hispanic"),
              MEDUC   = c("lower", "HS", "Some college", "Uni"),
              CIG_REC = c("No", "Yes")
              )
            ) |> 
  as.data.frame() |> 
  mutate(DBWT = estimate)  |> 
  select(DBWT, MRACEHISP, CIG_REC, MEDUC)

##            DBWT MRACEHISP CIG_REC        MEDUC
## 1   0.062274688  NH White      No        lower
## 2   0.044478085  NH White     Yes        lower
## 3   0.088547793  NH White      No           HS
## 4  -0.050146730  NH White     Yes           HS
## 5   0.087773609  NH White      No Some college
## 6  -0.132368083  NH White     Yes Some college
## 7   0.106740302  NH White      No          Uni
## 8   0.024046565  NH White     Yes          Uni
## 9  -0.177128407  NH Black      No        lower
## 10 -0.194925010  NH Black     Yes        lower
## 11 -0.150855303  NH Black      No           HS
## 12 -0.289549825  NH Black     Yes           HS
## 13 -0.151629486  NH Black      No Some college
## 14 -0.371771178  NH Black     Yes Some college
## 15 -0.132662793  NH Black      No          Uni
## 16 -0.215356530  NH Black     Yes          Uni
## 17 -0.039972642  NH Other      No        lower
## 18 -0.057769244  NH Other     Yes        lower
## 19 -0.013699537  NH Other      No           HS
## 20 -0.152394060  NH Other     Yes           HS
## 21 -0.014473721  NH Other      No Some college
## 22 -0.234615412  NH Other     Yes Some college
## 23  0.004492972  NH Other      No          Uni
## 24 -0.078200765  NH Other     Yes          Uni
## 25  0.006155744  Hispanic      No        lower
## 26 -0.011640859  Hispanic     Yes        lower
## 27  0.032428848  Hispanic      No           HS
## 28 -0.106265674  Hispanic     Yes           HS
## 29  0.031654665  Hispanic      No Some college
## 30 -0.188487027  Hispanic     Yes Some college
## 31  0.050621358  Hispanic      No          Uni
## 32 -0.032072379  Hispanic     Yes          Uni

#NH WHITE 
tribble(
  ~Education, ~No_Smoking, ~Smoking, 
  "Lower",    0.06227469,  0.04447809, 
  "HS",       0.08854779,  -0.05014673, 
 "Some college",0.087773609, -0.132368083, 
 "Uni", 0.106740302, 0.024046565, 
  ) |> ggplot() + 
  geom_segment(
    aes(
      x = "No Smoking", y = No_Smoking, 
      xend = "Smoking", yend = Smoking, 
      color = Education
      ),
    linewidth = 2
    ) + 
  geom_point(
    aes(x = "No Smoking", y = No_Smoking, color = Education), 
    size = 6
    ) + 
  geom_point(
    aes(x = "Smoking", y = Smoking, color = Education), 
    size = 6
    )

#NH BLACK 
tribble(
  ~Education, ~No_Smoking, ~Smoking, 
  "Lower",    -0.177128407,  -0.194925010, 
  "HS",       -0.150855303,  -0.289549825, 
 "Some college",-0.151629486, -0.132368083, 
 "Uni", -0.132662793,-0.215356530, 
  ) |> ggplot() + 
  geom_segment(
    aes(
      x = "No Smoking", y = No_Smoking, 
      xend = "Smoking", yend = Smoking, 
      color = Education
      ),
    linewidth = 2
    ) + 
  geom_point(
    aes(x = "No Smoking", y = No_Smoking, color = Education), 
    size = 6
    ) + 
  geom_point(
    aes(x = "Smoking", y = Smoking, color = Education), 
    size = 6
    )

#NH HISPANIC 
tribble(
  ~Education, ~No_Smoking, ~Smoking, 
  "Lower",    0.006155744,  -0.011640859, 
  "HS",       0.032428848,  -0.106265674, 
 "Some college",0.031654665, -0.188487027, 
 "Uni", 0.050621358,-0.032072379, 
  ) |> ggplot() + 
  geom_segment(
    aes(
      x = "No Smoking", y = No_Smoking, 
      xend = "Smoking", yend = Smoking, 
      color = Education
      ),
    linewidth = 2
    ) + 
  geom_point(
    aes(x = "No Smoking", y = No_Smoking, color = Education), 
    size = 6
    ) + 
  geom_point(
    aes(x = "Smoking", y = Smoking, color = Education), 
    size = 6
    )

#NH OTHER
tribble(
  ~Education, ~No_Smoking, ~Smoking, 
  "Lower",    -0.039972642,  -0.057769244, 
  "HS",       -0.013699537,  -0.152394060, 
 "Some college",-0.014473721, -0.234615412, 
 "Uni", 0.004492972,-0.078200765, 
  ) |> ggplot() + 
  geom_segment(
    aes(
      x = "No Smoking", y = No_Smoking, 
      xend = "Smoking", yend = Smoking, 
      color = Education
      ),
    linewidth = 2
    ) + 
  geom_point(
    aes(x = "No Smoking", y = No_Smoking, color = Education), 
    size = 6
    ) + 
  geom_point(
    aes(x = "Smoking", y = Smoking, color = Education), 
    size = 6
    )

Answer: Based on the graphs it seems that for almost every education level, smoking decreases the birth weight of the child. This is the effect we would expect. However, it seems that for some groups (lower education white moms, uni level ‘other’ moms, and HS level Hispanic moms), smoking actually seems to increase the baby’s birth weight. This is contrary to the trend we expect. This may be a case of a confounding factor that is masking the true relationship. For example, it could be possible that among ‘other’ moms, those who smoke and go to Uni are more likely to be in a higher social class, so their baby’s health may overall be higher than those who do not get education and can’t afford to smoke

Exercise 7:

Does the association between mother’s BMI and birth-weight differ between mothers of different race/ethnicity? Modify the OPTIMAL model appropriately to be able to answer this question, using the dataset with the centered variables. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, estimate the BMI effect at each level of mother’s race/ethnicity and plot the BMI effect at each level of mother’s race/ethnicity using a dumbbell plot. Provide a description, interpretation or reflection about what you see in the plot. What do you notice? what do you wonder? What research questions could your results prompt?

library(marginaleffects)

# Fit the model with the interaction term between BMI and MRACEHISP
model <- lm(DBWT ~ BMI * MRACEHISP, data = natality.ctr)

# Create a dataset with combinations of BMI and MRACEHISP
datagrid <- expand.grid(
  BMI = seq(min(natality.ctr$BMI), max(natality.ctr$BMI), length.out = 100),
  MRACEHISP = unique(natality.ctr$MRACEHISP)
)

# Predict birth weight for each combination of BMI and MRACEHISP
predictions <- predictions(model, newdata = datagrid)

# Calculate the BMI effect for each level of MRACEHISP
# This is done by subtracting the predicted birth weight when BMI increases by 1 unit
# from the predicted birth weight when BMI stays the same
effect_estimates <- predictions %>%
  group_by(MRACEHISP) %>%
  summarize(BMI_effect = diff(estimate)[1])

# Print the estimated BMI effects for each level of MRACEHISP
print(effect_estimates)

## # A tibble: 4 × 2
##   MRACEHISP BMI_effect
##   <fct>          <dbl>
## 1 NH White     0.00312
## 2 NH Black     0.00239
## 3 NH Other     0.00725
## 4 Hispanic     0.00196

# Plot the BMI effect at each level of MRACEHISP using a dumbbell plot
ggplot(effect_estimates, aes(x = BMI_effect, y = MRACEHISP)) +
  geom_point(size = 3, color = "blue") +
  geom_segment(aes(x = 0, xend = BMI_effect, y = MRACEHISP, yend = MRACEHISP), color = "red") +
  labs(title = "BMI Effect on Birth Weight by Race/Ethnicity",
       x = "BMI Effect",
       y = "Race/Ethnicity") +
  theme_minimal()

Answer: The BMI effect appears to be greatest in the group of “other” in the mother’s race. This is interesting, as the exact ethnic makeup of this group is unknown. It may be that within these groups there are some observations skewing the data.

Aside from the group ‘other’, the order of the coefficients is white, black, and then hispanic. This indicates that among white mothers, an increase in the mother’s BMI has a proportionately greater impact on the birth weight of the child than a similar BMI increase does among mothers of other races. The reasons for this are unknown, but there could be some interaction with other racial differences (genetics, obesity rate, differences in metabolism, gestation, etc.)

Exercise 8:

Does the association between mother’s BMI and birth-weight depend on mother’s age? Modify the FULL model appropriately to be able to answer this question, using the dataset with the centered variables. Use anova to test whether the model with the interaction is significantly better than the one without it. Regardless of statistical significance, estimate the BMI effect for mothers age 20, 30, and 40 years and make a plot illustrating the BMI effect at each of these ages. What do you notice? what do you wonder? What research questions could your results prompt?

# Write your code here

# Write your code here
# 1. Modify the Full Model
model_interaction <- lm(DBWT ~ BMI * MAGER, data = natality.ctr)

# 2. Perform ANOVA
model_without_interaction <- lm(DBWT ~ BMI + MAGER, data = natality.ctr)
anova_result <- anova(model_interaction, model_without_interaction)
print(anova_result)

## Analysis of Variance Table
## 
## Model 1: DBWT ~ BMI * MAGER
## Model 2: DBWT ~ BMI + MAGER
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1684 417.30                           
## 2   1685 417.72 -1  -0.42146 1.7008 0.1924

# 3. Estimate BMI Effect for Specific Ages (20, 30, 40)
ages <- c(20, 30, 40)
BMI_effects <- sapply(ages, function(age) {
  predict(model_interaction, newdata = data.frame(BMI = mean(natality.ctr$BMI), MAGER = age))
})

# 4. Create a Plot
library(ggplot2)
age_effect_plot <- data.frame(Age = ages, BMI_Effect = BMI_effects)
ggplot(data = age_effect_plot, aes(x = Age, y = BMI_Effect)) +
  geom_point(size = 3, color = "blue") +
  geom_line(color = "red") +
  labs(title = "BMI Effect on Birth Weight at Different Ages",
       x = "Mother's Age",
       y = "BMI Effect",
       caption = "Interaction: BMI * Age") +
  theme_minimal()

Answer: The p-value (Pr(>F)) is 0.1924. With a significance level (α) commonly set at 0.05, a p-value greater than α (0.05) suggests that we fail to reject the null hypothesis.Therefore, we do not have enough evidence to conclude that including the interaction term (BMI * MAGER) significantly improves the fit of the model compared to having only the main effects (BMI + MAGER)

Estimations of BMI: For a mother aged 20, the BMI effect estimate is around 1.3, for a mother aged 30 it is around 1.95 and for a women aged 40 it is around 2.6.

There seems to be a linear relationship between the mother’s age and the BMI effect. As the age of the mother increases, the BMI effect on the birth weight also increases. .