1 Note sulla precedente consegna

Ciao! 😁

Mi scuso per gli errori scemi, purtroppo alcune delle parti iniziali le avevo buttate giù in bozza mesi fa e lasciate da parte causa lavoro, in fase di revisione non ho minimamente notato le incongruenze sulle variabili categoriche, sarà stato il clima pasquale ahahah.

Ho corretto le descrizioni delle variabili nel dataset, mi sono accorto anche di qualche altra imprecisione.

Ho modificato il grafico degli istogrammi delle variabili categoriche che effettivamente aveva troppi colori, ma ho mantenuto i colori nei boxplot e nei violinplot perché a mio avviso lì consentono di associare alle stesse varibaili il plot corretto in maniera semplice passando da un grafico all’altro.

Per quanto riguarda le problematiche di modellistica, si è vista la mia scarsa conoscenza del mondo medicale, non avevo pensato alle misurazioni dall’ecografia!

In ogni caso:

Ho tenuto le variabili misurabili a ecografia e rimosso quelle inutili

Ho modificato la parte model selection e ho rivisitato la parte di non-linearità, mi sono accorto di una interessante relazione logaritmica tra peso e lunghezza del neonato che dà un piccolo boost alla varianza spiegata dal modello.

Ho mostrato anche quale fosse l’osservazione con alta distanza di Cook.

Spero ora sia tutto più chiaro e corretto!

2 Introduction

In this project, we will analyze data regarding 2500 newborns and their mothers, to understand if it is possible to create a statistical model to predict their weight based on the collected data. These data are contained in the newborns.csv dataset.

First of all, let’s import the dataset to have a look at its structure:

newborns_data <- read.csv("newborns.csv")
attach(newborns_data)
head(newborns_data)

2.1 Description of the different variables

We can separate the dataset’s features in the following categories:

Variables	Type	Scale	Description
`Mother.age`	Numerical, continuous	Ordinal	Mother’s age [in years]
`Mother.pregnancies`	Numerical, discrete	Ratio	Number of sustained pregnancies
`Mother.is.smoker`	Categorical, discrete	Nominal	If the mother is a smoker [0 = no, 1 = yes]
`Mother.gestation.weeks`	Numerical, continuous	Ratio	Time of gestation [in weeks]
`newborn.weight.g`	Numerical, continuous	Ratio	Weight of the newborn [in grams]
`newborn.length.mm`	Numerical, continuous	Ratio	Length of the newborn [in mm]
`newborn.skull.length.mm`	Numerical, continuous	Ratio	Cranial diameter of the newborn [in mm]
`newborn.birth.kind`	Categorical, discrete	Nominal	Type of birth [Nat = natural, Ces = cesarean]
`birth.hospital`	Categorical, discrete	Nominal	Hospital where the birth took place [1, 2 or 3]
`newborn.sex`	Categorical, discrete	Nominal	Newborn’s sex [M or F]

3 Analysis and Modeling

3.1 Data Exploration and Cleaning

In this section, we perform an initial exploration and cleaning of the dataset. This includes generating summary statistics, visualizing variable distributions, identifying outliers, and checking for missing values. These steps are essential to ensure the quality and reliability of subsequent statistical analyses and modeling.

3.1.1 Summary Statistics

To gain an initial understanding of the dataset, we compute summary statistics for all variables. This provides insights into the central tendency, spread, and potential anomalies in the data.

# Select only numeric columns for summary
numeric_cols <- setdiff(names(newborns_data), c("Mother.is.smoker", "newborn.birth.kind", "birth.hospital", "newborn.sex"))
as.data.frame(do.call(cbind, lapply(newborns_data[, numeric_cols], summary)))

Apparently, there is at least one record of a mothers with age 0. Let’s verify what is the problem with this record:

newborns_data %>% filter(Mother.age <= 5)

There is also a record of a mother with age 1. This is probably a data entry error, so we will remove it from the dataset.

newborns_data <- newborns_data %>% filter(Mother.age > 1)

3.1.2 Check for Missing Values

It is important to identify missing values before proceeding with analysis, as they can affect the results and may require imputation or removal.

colSums(is.na(newborns_data))

##              Mother.age      Mother.pregnancies        Mother.is.smoker 
##                       0                       0                       0 
##  Mother.gestation.weeks        newborn.weight.g       newborn.length.mm 
##                       0                       0                       0 
## newborn.skull.length.mm      newborn.birth.kind          birth.hospital 
##                       0                       0                       0 
##             newborn.sex 
##                       0

We are lucky, there are no missing values in the dataset!

3.1.3 Visualizations of Continuous Variables

Visualizing the distributions of continuous variables helps to detect skewness, outliers, and unusual patterns. Here, we use ggplot2 for more flexible and publication-ready graphics.

3.1.3.1 Histograms of Continuous Variables

We use faceted histograms to display the distribution of each continuous variable.

# Gather continuous variables into long format
gathered <- newborns_data %>%
  select(Mother.age, Mother.gestation.weeks, newborn.weight.g, newborn.length.mm, newborn.skull.length.mm) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(gathered, aes(x = Value)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "white") +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Histograms of Continuous Variables")

The histograms above show that most continuous variables are approximately normally distributed, with some skewness in variables such as the number of pregnancies (right-skewed) and gestational weeks (slightly left-skewed). The distributions for newborn length, skull length, and weight are fairly symmetric, while mother’s age is centered around the late 20s to early 30s. There are no obvious extreme outliers, but the number of pregnancies has a long tail, indicating a few mothers with a high number of pregnancies.

3.1.3.2 Faceted Boxplots of Continuous Variables

To better visualize the spread and outliers for each variable, we plot separate boxplots for each continuous variable using facets.

ggplot(gathered, aes(y = Value, x = "", fill = Variable)) +
  geom_boxplot(show.legend = FALSE, outlier.color = "red") +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Faceted Boxplots of Continuous Variables", x = "", y = "Value")

The faceted boxplots above provide a clearer view of the distribution and outliers for each continuous variable:

Mother.age: The distribution is centered around 30 years, with a few younger and older mothers appearing as outliers. The spread is moderate, and the presence of outliers at both ends suggests some rare cases of very young or older mothers.
Mother.gestation.weeks: Most gestational ages cluster tightly around 39-40 weeks, but there are several lower outliers, indicating some preterm births.
Mother.pregnancies: The majority of mothers have 0-2 pregnancies, but there are a number of outliers with higher counts, reflecting a long right tail.
newborn.length.mm: The distribution is fairly symmetric, but there are a few outliers on both the lower and upper ends, which may represent measurement or data entry errors, or true biological extremes.
newborn.skull.length.mm: This variable is also symmetric, with a moderate spread and some outliers at both extremes.
newborn.weight.g: There is a wide range of birth weights, with several low and high outliers. This reflects the natural variability in newborn weight, but extreme values may warrant further investigation for potential data quality issues or rare clinical cases.

Overall, the boxplots highlight the presence of outliers in all variables, especially in pregnancies and birth weight. These should be considered in subsequent analyses, as they may influence statistical modeling and inference.

3.1.3.3 Faceted Violin Plots of Continuous Variables

Violin plots provide additional information about the distribution shape and density for each variable, complementing the boxplots.

# Gather continuous variables into long format for violin plots
gathered_violin <- newborns_data %>%
  select(Mother.age, Mother.gestation.weeks, newborn.weight.g, newborn.length.mm, newborn.skull.length.mm) %>% # Excluded Mother.pregnancies
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(gathered_violin, aes(y = Value, x = "", fill = Variable)) + # Removed fill from aes
  geom_violin(show.legend = FALSE, color = "black") + # Added static fill
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Faceted Violin Plots of Continuous Variables", x = "", y = "Value")

From these violin plots, we can draw several additional insights:

Mother.age: The distribution is unimodal and fairly symmetric, with most mothers clustered around 30 years old. The tails confirm the presence of a few very young and older mothers, as seen in the density tapering at both ends.
Mother.gestation.weeks: There is a sharp peak at 39-40 weeks, indicating that most pregnancies reach full term. The long left tail highlights the presence of preterm births.
newborn.length.mm, newborn.skull.length.mm, newborn.weight.g: These variables are roughly symmetric, but the violin plots reveal subtle features such as slight skewness or minor secondary peaks. For example, newborn weight shows a broad, slightly right-skewed distribution, with a few very low and very high values.

Violin plots complement the boxplots and summarize both boxplots and histogram conclusion in a single plot, providing also a richer view of the data’s distribution, helping to confirm the presence of skewness, multimodality, and outliers that may influence further statistical analysis.

3.1.4 Visualizations of Frequency Tables for Categorical Variables

To better understand the distribution of categorical variables, we visualize their frequencies using bar plots.

cat_vars <- c("Mother.is.smoker", "newborn.birth.kind", "birth.hospital", "newborn.sex")
cat_data <- newborns_data %>%
  mutate(across(all_of(cat_vars), as.character)) %>%
  select(all_of(cat_vars)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category")

ggplot(cat_data, aes(x = as.factor(Category))) +
  geom_bar(fill = "skyblue", show.legend = FALSE) +
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.3, size = 3) +
  facet_wrap(~ Variable, scales = "free_x") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
  theme_minimal() +
  labs(title = "Bar Plots of Categorical Variables", x = "Category", y = "Count")

The bar plots above show the distribution of categorical variables:

birth.hospital: The sample is well balanced across the three hospitals, with a similar number of births in each.
Mother.is.smoker: The vast majority of mothers are non-smokers, with only a small proportion reporting smoking during pregnancy.
newborn.birth.kind: Natural births are more common than cesarean sections in this dataset.
newborn.sex: The distribution of newborn sex is nearly even between males and females.

These plots help confirm the representativeness and balance of the sample, and highlight the low prevalence of maternal smoking, which may be important for subsequent analysis.

3.2 Hypothesis Testing

In this section, we conduct statistical hypothesis tests to investigate key research questions about the neonatal dataset. Each test includes a clear statement of hypotheses, selection of the appropriate test statistic, calculation of the p-value, interpretation of results, and supporting visualizations.

3.2.1 Cesarean Delivery Rates Across Hospitals (Chi-square Test)

Research Question: Are cesarean delivery rates independent of the hospital where the birth took place?

Null hypothesis (H₀): Cesarean delivery rates are independent of hospital.
Alternative hypothesis (H₁): Cesarean delivery rates differ by hospital.

# Chi-square test: Cesarean delivery rates by hospital
# Create contingency table
cesarean_table <- table(newborns_data$newborn.birth.kind, newborns_data$birth.hospital)
# Only consider 'Ces' vs. 'Nat' if both present
# Perform chi-square test
chisq_result <- chisq.test(cesarean_table)
chisq_result

## 
##  Pearson's Chi-squared test
## 
## data:  cesarean_table
## X-squared = 1.1, df = 2, p-value = 0.6

# Bar plot of cesarean rates by hospital
cesarean_prop <- prop.table(cesarean_table, 2)
barplot(cesarean_prop, beside = TRUE, legend = TRUE,
        col = c("skyblue", "salmon"),
        main = "Cesarean vs. Natural Births by Hospital",
        xlab = "Hospital", ylab = "Proportion")

Interpretation:

The chi-squared test for independence between hospital and cesarean delivery rates yields a p-value of 0.6, which is much greater than the conventional significance level of 0.05. Therefore, we do not reject the null hypothesis: there is no statistically significant evidence that cesarean delivery rates differ across the three hospitals in our sample.

The bar plot visually confirms this result, showing that the proportions of cesarean and natural births are very similar across all hospitals. There are no substantial differences in cesarean rates between hospitals, supporting the conclusion of the statistical test.

3.2.2 One-sample t-tests: Neonatal Weight and Length vs. Population Values

Research Question: Are the mean weight and length of neonates in our sample significantly different from known population values?

Null hypothesis (H₀): Sample mean equals population mean.
Alternative hypothesis (H₁): Sample mean differs from population mean.

According to the World Health Organization, the average weight of a newborn is ~3250 grams taking into account both boys (3300 g) and girls (3200 g), while the average length is ~495 mm (500 mm for boys, 490 mm for girls).

pop_mean_weight <- 3250
pop_mean_length <- 495
# Weight
ttest_weight <- t.test(
  newborns_data$newborn.weight.g,
  mu = pop_mean_weight,
  conf.level = 0.95,
  alternative = "two.sided"
)
ttest_weight

## 
##  One Sample t-test
## 
## data:  newborns_data$newborn.weight.g
## t = 3.3, df = 2497, p-value = 0.001
## alternative hypothesis: true mean is not equal to 3250
## 95 percent confidence interval:
##  3264 3305
## sample estimates:
## mean of x 
##      3284

# Length
ttest_length <- t.test(
  newborns_data$newborn.length.mm,
  mu = pop_mean_length,
  conf.level = 0.95,
  alternative = "two.sided"
)
ttest_length

## 
##  One Sample t-test
## 
## data:  newborns_data$newborn.length.mm
## t = -0.58, df = 2497, p-value = 0.6
## alternative hypothesis: true mean is not equal to 495
## 95 percent confidence interval:
##  493.7 495.7
## sample estimates:
## mean of x 
##     494.7

# Prepare data for ggplot
plot_data <- newborns_data %>%
  select(newborn.weight.g, newborn.length.mm) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Set means for lines
means <- data.frame(
  Variable = c("newborn.weight.g", "newborn.length.mm"),
  pop_mean = c(pop_mean_weight, pop_mean_length),
  sample_mean = c(mean(newborns_data$newborn.weight.g), mean(newborns_data$newborn.length.mm))
)
# Plot
p <- ggplot(plot_data, aes(x = Value, fill = Variable)) +
  geom_histogram(bins = 20, color = "black", alpha = 0.7) +
  facet_wrap(~ Variable, scales = "free", labeller = as_labeller(
    c("newborn.weight.g" = "Newborn Weight (g)", "newborn.length.mm" = "Newborn Length (mm)")
  )) +
  geom_vline(data = means, aes(xintercept = pop_mean, color = "Population Mean"), linetype = "dashed", size = 1) +
  geom_vline(data = means, aes(xintercept = sample_mean, color = "Sample Mean"), linetype = "solid", size = 1) +
  scale_color_manual(name = NULL, values = c("Population Mean" = "red", "Sample Mean" = "black")) +
  scale_fill_manual(values = c("newborn.weight.g" = "skyblue", "newborn.length.mm" = "salmon"), guide = "none") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = NULL, y = "Frequency")
print(p)

Interpretation:

For weight, the one-sample t-test yields a p-value of 0.001, which is less than the conventional significance level of 0.05. Therefore, we reject the null hypothesis: the mean weight of newborns in our sample (3284g, 95% CI: 3264–3305g) is significantly different from the population mean, and specifically, it is higher.
For length, the one-sample t-test yields a p-value of 0.6, much greater than 0.05. Therefore, we do not reject the null hypothesis: the mean length of newborns in our sample (494.7mm, 95% CI: 493.7–495.7mm) is not significantly different from the population mean.

The histograms visually confirm these results: the sample mean (solid black line) for weight is shifted to the right of the population mean (dashed red line), while the sample mean for length is almost perfectly aligned with the population mean, indicating strong agreement with the reference value.

3.2.3 Two-sample t-tests: Anthropometric Differences by Sex

Research Question: Do anthropometric measurements (weight, length, skull length) differ significantly between male and female neonates?

Null hypothesis (H₀): No difference in means between male and female neonates.
Alternative hypothesis (H₁): Means differ between male and female neonates.

# Two-sample t-tests: Anthropometric differences by sex
# Ensure sex is a factor
newborns_data$newborn.sex <- as.factor(newborns_data$newborn.sex)
# Weight
ttest_sex_weight <- t.test(newborn.weight.g ~ newborn.sex, data=newborns_data)
ttest_sex_weight

## 
##  Welch Two Sample t-test
## 
## data:  newborn.weight.g by newborn.sex
## t = -12, df = 2489, p-value <0.0000000000000002
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.5 -207.4
## sample estimates:
## mean in group F mean in group M 
##            3161            3408

# Visualization
boxplot(newborn.weight.g ~ newborn.sex, data=newborns_data, main="Weight by Sex", col=c("skyblue","salmon"), ylab="Weight (g)")

# Length
ttest_sex_length <- t.test(newborn.length.mm ~ newborn.sex, data=newborns_data)
ttest_sex_length

## 
##  Welch Two Sample t-test
## 
## data:  newborn.length.mm by newborn.sex
## t = -9.6, df = 2457, p-value <0.0000000000000002
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -11.939  -7.883
## sample estimates:
## mean in group F mean in group M 
##           489.8           499.7

# Visualization
boxplot(newborn.length.mm ~ newborn.sex, data=newborns_data, main="Length by Sex", col=c("skyblue","salmon"), ylab="Length (mm)")

# Skull length
ttest_sex_skull <- t.test(newborn.skull.length.mm ~ newborn.sex, data=newborns_data)
ttest_sex_skull

## 
##  Welch Two Sample t-test
## 
## data:  newborn.skull.length.mm by newborn.sex
## t = -7.4, df = 2489, p-value = 0.0000000000001
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -6.111 -3.560
## sample estimates:
## mean in group F mean in group M 
##           337.6           342.5

# Visualization
boxplot(newborn.skull.length.mm ~ newborn.sex, data=newborns_data, main="Skull Length by Sex", col=c("skyblue","salmon"), ylab="Skull Length (mm)")

Interpretation:

For all three anthropometric measures—weight, length, and skull length—the two-sample t-tests yield extremely small p-values (all p < 0.0001), much less than the conventional significance level of 0.05. Therefore, we reject the null hypothesis in each case: there are statistically significant differences between male and female newborns for all these measurements.

Weight: Mean for females = 3161g, mean for males = 3408g (difference: males heavier)
Length: Mean for females = 489.8mm, mean for males = 499.7mm (difference: males longer)
Skull length: Mean for females = 337.6mm, mean for males = 342.5mm (difference: males larger)

These results, visually confirmed by the boxplots, indicate that, in this sample, male newborns are on average heavier, longer, and have greater skull length than female newborns, and these differences are highly statistically significant.

3.3 Multiple Linear Regression Model Development

In this subsection, we develop a multiple linear regression model to predict neonatal birth weight based on the available clinical and demographic variables. This step builds on the cleaned and explored dataset from previous tasks, and leverages the insights gained from hypothesis testing.

3.3.1 Variable Preparation

Before fitting the model, we ensure that all variables are appropriately coded. Categorical variables are converted to factors, and we check for the need for any transformations or interaction terms. The main predictors of interest include maternal smoking status, gestational weeks, and other relevant variables identified in previous analyses.

# Convert categorical variables to factors
newborns_data$Mother.is.smoker <- as.factor(newborns_data$Mother.is.smoker)
newborns_data$newborn.birth.kind <- as.factor(newborns_data$newborn.birth.kind)
newborns_data$birth.hospital <- as.factor(newborns_data$birth.hospital)
newborns_data$newborn.sex <- as.factor(newborns_data$newborn.sex)
# Check levels
lapply(newborns_data[, c('Mother.is.smoker', 'newborn.birth.kind', 'birth.hospital', 'newborn.sex')], levels)

## $Mother.is.smoker
## [1] "0" "1"
## 
## $newborn.birth.kind
## [1] "Ces" "Nat"
## 
## $birth.hospital
## [1] "osp1" "osp2" "osp3"
## 
## $newborn.sex
## [1] "F" "M"

3.3.2 Initial Model Fitting

We begin by fitting a multiple linear regression model with all relevant predictors. The response variable is neonatal birth weight (in grams). We include maternal age, number of pregnancies, gestational weeks, maternal smoking status, birth kind, hospital, and newborn sex as predictors. We will later assess the significance of each variable and consider possible interactions or non-linear effects.

Note that we are not including all the variables in the model, since some of them (e.g. newborn.birth.kind or birth.hospital) will not be available for prediction while newborns are still in the womb or clearly unrelated to the newborn’s weight.

# Fit initial multiple linear regression model
initial_model <- lm(newborn.weight.g ~ Mother.age + Mother.pregnancies +
                      Mother.gestation.weeks + Mother.is.smoker +
                      newborn.sex + newborn.length.mm +
                      newborn.skull.length.mm,
                    data = newborns_data)
summary(initial_model)

## 
## Call:
## lm(formula = newborn.weight.g ~ Mother.age + Mother.pregnancies + 
##     Mother.gestation.weeks + Mother.is.smoker + newborn.sex + 
##     newborn.length.mm + newborn.skull.length.mm, data = newborns_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1160.6  -181.3   -15.7   163.6  2630.7 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)             -6712.240    141.334  -47.49 < 0.0000000000000002 ***
## Mother.age                  0.880      1.149    0.77                0.444    
## Mother.pregnancies         11.379      4.677    2.43                0.015 *  
## Mother.gestation.weeks     32.947      3.829    8.61 < 0.0000000000000002 ***
## Mother.is.smoker1         -30.396     27.608   -1.10                0.271    
## newborn.sexM               78.079     11.213    6.96      0.0000000000042 ***
## newborn.length.mm          10.232      0.301   33.98 < 0.0000000000000002 ***
## newborn.skull.length.mm    10.520      0.427   24.63 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2490 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.726 
## F-statistic:  948 on 7 and 2490 DF,  p-value: <0.0000000000000002

Interpretation of the Initial Regression Model:

The initial multiple linear regression model, now including neonatal length and skull measurements and excluding birth type and hospital, provides the following insights:

Model Fit:
- The model explains approximately 72.6% of the variance in birth weight (Adjusted R-squared = 0.726).
- The overall F-statistic is highly significant (p ~ 2e-16), indicating that the model as a whole is a statistically significant fit for the data.
Significant Predictors:
- newborn.length.mm: Each additional millimeter in newborn length is associated with an increase of, on average, 10.23 grams in birth weight (p < 2e-16).
- newborn.skull.length.mm: Each additional millimeter in newborn skull length is associated with an increase of, on average, 10.52 grams in birth weight (p < 2e-16).
- Mother.gestation.weeks: Each additional week of gestation is associated with an increase of, on average, 32.95 grams in birth weight (p < 2e-16).
- newborn.sexM: Male newborns are, on average, 78.08 grams heavier than female newborns (p ≈ 4.2e-12).
- Mother.pregnancies: Each additional pregnancy is associated with an increase of, on average, 11.38 grams in birth weight (p = 0.015).
Non-significant Predictors:
- Mother.age: Maternal age does not show a statistically significant association with birth weight in this model (p = 0.444).
- Mother.is.smoker1: Maternal smoking status does not show a statistically significant association with birth weight in this model (p = 0.271), when accounting for other factors like newborn measurements.
Intercept:
- The intercept is -6712.24. This value represents the predicted birth weight when all continuous predictors in the model are zero and categorical predictors are at their reference level (female for newborn.sex, non-smoker for Mother.is.smoker). In this context, an intercept far from a plausible birth weight is common and primarily serves to adjust the model’s baseline.
Residuals:
- The residual standard error is 275 grams. This indicates that, on average, the model’s predictions of newborn weight are off by approximately 275 grams.

Next steps:

Assess the need for interaction terms (e.g., between maternal smoking and gestational weeks) or non-linear effects.
Consider removing non-significant predictors to improve model parsimony.
Perform diagnostic checks (residual analysis, multicollinearity, etc.) to validate model assumptions.

3.4 Model Selection and Validation

In this subsection, we focus on optimizing the multiple linear regression model for neonatal birth weight prediction. The goal is to select the most parsimonious model using statistical criteria, validate its assumptions, and ensure robust predictive performance.

3.4.1 Model Selection: Stepwise Regression (AIC/BIC)

We use stepwise regression based on AIC (Akaike Information Criterion) to identify the optimal set of predictors. This process iteratively adds or removes variables to minimize the AIC, balancing model fit and complexity. We also compare with BIC (Bayesian Information Criterion) for additional rigor.

# Stepwise selection using AIC
step_model_aic <- step(initial_model, direction = "both", trace = FALSE)
summary(step_model_aic)

## 
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks + 
##     newborn.sex + newborn.length.mm + newborn.skull.length.mm, 
##     data = newborns_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1149.4  -181.0   -15.6   163.7  2639.1 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)             -6681.725    135.804  -49.20 < 0.0000000000000002 ***
## Mother.pregnancies         12.455      4.342    2.87               0.0042 ** 
## Mother.gestation.weeks     32.383      3.801    8.52 < 0.0000000000000002 ***
## newborn.sexM               77.981     11.211    6.96      0.0000000000045 ***
## newborn.length.mm          10.245      0.301   34.06 < 0.0000000000000002 ***
## newborn.skull.length.mm    10.541      0.426   24.72 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2492 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.726 
## F-statistic: 1.33e+03 on 5 and 2492 DF,  p-value: <0.0000000000000002

# Stepwise selection using BIC
n <- nrow(newborns_data)
step_model_bic <- step(initial_model, direction = "both", k = log(n), trace = FALSE)
summary(step_model_bic)

## 
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks + 
##     newborn.sex + newborn.length.mm + newborn.skull.length.mm, 
##     data = newborns_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1149.4  -181.0   -15.6   163.7  2639.1 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)             -6681.725    135.804  -49.20 < 0.0000000000000002 ***
## Mother.pregnancies         12.455      4.342    2.87               0.0042 ** 
## Mother.gestation.weeks     32.383      3.801    8.52 < 0.0000000000000002 ***
## newborn.sexM               77.981     11.211    6.96      0.0000000000045 ***
## newborn.length.mm          10.245      0.301   34.06 < 0.0000000000000002 ***
## newborn.skull.length.mm    10.541      0.426   24.72 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2492 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.726 
## F-statistic: 1.33e+03 on 5 and 2492 DF,  p-value: <0.0000000000000002

Both techniques converged to the same model, where Mother.age and Mother.is.smoker1 are removed. This is not surprising, since they were the least significant predictors in the initial model.

3.4.2 Model Simplification

Based on the stepwise selection results, we remove non-significant predictors (such as birth kind and hospital, if not retained) and refit the model. We compare the simplified model to the initial model using AIC, BIC, and adjusted R-squared.

# The model selected by both AIC and BIC stepwise regression
simplified_model <- lm(newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks +
                         newborn.sex + newborn.length.mm + newborn.skull.length.mm,
                       data = newborns_data)
summary(simplified_model)

## 
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks + 
##     newborn.sex + newborn.length.mm + newborn.skull.length.mm, 
##     data = newborns_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1149.4  -181.0   -15.6   163.7  2639.1 
## 
## Coefficients:
##                          Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)             -6681.725    135.804  -49.20 < 0.0000000000000002 ***
## Mother.pregnancies         12.455      4.342    2.87               0.0042 ** 
## Mother.gestation.weeks     32.383      3.801    8.52 < 0.0000000000000002 ***
## newborn.sexM               77.981     11.211    6.96      0.0000000000045 ***
## newborn.length.mm          10.245      0.301   34.06 < 0.0000000000000002 ***
## newborn.skull.length.mm    10.541      0.426   24.72 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2492 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.726 
## F-statistic: 1.33e+03 on 5 and 2492 DF,  p-value: <0.0000000000000002

AIC(initial_model, step_model_aic, step_model_bic, simplified_model)

BIC(initial_model, step_model_aic, step_model_bic, simplified_model)

Model Simplification Results and Interpretation:

The stepwise regression procedures using both AIC and BIC yielded the identical, more parsimonious model:

Selected Model (AIC and BIC):
- This model excludes Mother.age and Mother.is.smoker from the initial model.
- Adjusted R-squared: 0.726. This is virtually unchanged from the initial model’s R-squared, confirming that the removed variables had negligible additional predictive power when the current set of predictors is considered.
- Residual standard error: 275 grams.
- All included predictors are statistically significant:
  - Mother.pregnancies: Each additional pregnancy is associated with an increase of, on average, 12.46 grams in birth weight (p = 0.0042).
  - Mother.gestation.weeks: Each additional week of gestation is associated with an increase of, on average, 32.38 grams in birth weight (p < 2e-16).
  - newborn.sexM: Male newborns are, on average, 77.98 grams heavier than female newborns (p ≈ 4.5e-12).
  - newborn.length.mm: Each additional millimeter in newborn length is associated with an increase of, on average, 10.25 grams in birth weight (p < 2e-16).
  - newborn.skull.length.mm: Each additional millimeter in newborn skull length is associated with an increase of, on average, 10.54 grams in birth weight (p < 2e-16).
- The intercept is -6681.725.
Model Comparison and Considerations:
- Both AIC and BIC criteria agreed on removing Mother.age and Mother.is.smoker. This suggests these variables do not significantly improve the model fit once other strong predictors like newborn measurements (length, skull diameter) and gestational age are included.
- The resulting simplified model maintains the same explanatory power (Adjusted R-squared of 0.726) and residual standard error (275g) as the initial_model that included the two now-excluded variables.
- This convergence on a simpler model by both selection methods provides strong support for its adoption. It is more parsimonious without sacrificing predictive accuracy based on the data.

This simplified model, simplified_model, will now be used for exploring potential interaction terms and non-linear effects.

3.4.3 Exploring Interactions and Non-linear Effects

We now extend the model selection process to consider possible interaction terms and non-linear effects. This allows us to capture more complex relationships between predictors and neonatal birth weight, potentially improving model fit and interpretability.

3.4.3.1 Testing Interaction Terms

We test interactions that are clinically plausible, such as:

Maternal smoking × gestational weeks
Sex × gestational weeks

# Add interaction terms to the AIC-selected model
interaction_model <- lm(newborn.weight.g ~ Mother.age +
                          Mother.pregnancies +
                          Mother.gestation.weeks +
                          Mother.is.smoker +
                          newborn.sex +
                          newborn.length.mm +
                          newborn.skull.length.mm +
                          Mother.gestation.weeks * Mother.is.smoker +
                          newborn.sex * Mother.gestation.weeks,
                        data = newborns_data)
summary(interaction_model)

## 
## Call:
## lm(formula = newborn.weight.g ~ Mother.age + Mother.pregnancies + 
##     Mother.gestation.weeks + Mother.is.smoker + newborn.sex + 
##     newborn.length.mm + newborn.skull.length.mm + Mother.gestation.weeks * 
##     Mother.is.smoker + newborn.sex * Mother.gestation.weeks, 
##     data = newborns_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1155.2  -181.0   -15.8   162.5  2629.1 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                              -6615.822    172.034  -38.46
## Mother.age                                   0.869      1.149    0.76
## Mother.pregnancies                          11.401      4.677    2.44
## Mother.gestation.weeks                      30.516      4.629    6.59
## Mother.is.smoker1                          795.043    757.696    1.05
## newborn.sexM                              -194.817    234.995   -0.83
## newborn.length.mm                           10.230      0.301   33.97
## newborn.skull.length.mm                     10.516      0.427   24.62
## Mother.gestation.weeks:Mother.is.smoker1   -21.049     19.288   -1.09
## Mother.gestation.weeks:newborn.sexM          7.008      6.015    1.17
##                                                      Pr(>|t|)    
## (Intercept)                              < 0.0000000000000002 ***
## Mother.age                                              0.450    
## Mother.pregnancies                                      0.015 *  
## Mother.gestation.weeks                         0.000000000053 ***
## Mother.is.smoker1                                       0.294    
## newborn.sexM                                            0.407    
## newborn.length.mm                        < 0.0000000000000002 ***
## newborn.skull.length.mm                  < 0.0000000000000002 ***
## Mother.gestation.weeks:Mother.is.smoker1                0.275    
## Mother.gestation.weeks:newborn.sexM                     0.244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2488 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.727 
## F-statistic:  738 on 9 and 2488 DF,  p-value: <0.0000000000000002

AIC(interaction_model)

## [1] 35157

BIC(interaction_model)

## [1] 35221

Interpretation of the Interaction Model:

The tested interactions (gestational weeks × maternal smoking, gestational weeks × newborn sex) are not statistically significant and do not improve model fit (Adjusted R² = 0.727, AIC/BIC not significantly different) compared to the simpler model.
Adding these interaction terms does not provide additional explanatory value or improve predictive performance.
For parsimony and interpretability, the simpler model without interactions is preferable unless there is strong clinical or theoretical justification for including them.

3.4.3.2 Testing Non-linear Effects

We also test for non-linear effects, such as a quadratic term for gestational weeks, to see if the relationship with birth weight is better captured by a curve.

First, we plot the correlation matrix of the simplified model variables to check for evidence of non-linear relationships.

# Correlation plot for selected variables (improved readability)
library(GGally)
corr_vars <- newborns_data %>%
  select(
    "Mother's Age" = Mother.age,
    "Pregnancies" = Mother.pregnancies,
    "Gest. Weeks" = Mother.gestation.weeks,
    "Weight (g)" = newborn.weight.g,
    "Length (mm)" = newborn.length.mm,
    "Skull L.(mm)" = newborn.skull.length.mm
  )

# Custom theme for better readability
custom_theme <- theme_minimal(base_size = 12) +
  theme(
    strip.text = element_text(size = 8, face = "bold"),
    axis.text = element_text(size = 5),
    axis.title = element_text(size = 8)
  )

# Plot correlation matrix with improved labels and theme
p_corr <- ggpairs(
  corr_vars,
  lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.7, color = "skyblue")),
  diag = list(continuous = wrap("barDiag", fill = "skyblue", color = "grey40")),
  upper = list(continuous = wrap("cor", size = 5, color = "black", family = "mono"))
) +
  custom_theme

print(p_corr)

By eye, we can see that length, skull length and gestational weeks seems to have a non-linear relationship with birth weight.

After some tests, the best model is obtained by adding a quadratic term in gestational weeks and a logarithmic term in newborn length.

# Add quadratic term for gestational weeks
newborns_data$Mother.gestation.weeks2 <- newborns_data$Mother.gestation.weeks^2
newborns_data$newborn.length.mm.log10 <- log10(newborns_data$newborn.length.mm)
nonlinear_model <- lm(newborn.weight.g ~ Mother.pregnancies +
                        Mother.gestation.weeks +
                        Mother.gestation.weeks2 +
                        newborn.sex +
                        newborn.length.mm +
                        newborn.length.mm.log10 +
                        newborn.skull.length.mm,
                      data = newborns_data)
summary(nonlinear_model)

## 
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks + 
##     Mother.gestation.weeks2 + newborn.sex + newborn.length.mm + 
##     newborn.length.mm.log10 + newborn.skull.length.mm, data = newborns_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1192.6  -183.3   -12.8   163.7  1323.2 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)              81754.545   7811.154   10.47 < 0.0000000000000002 ***
## Mother.pregnancies          14.578      4.239    3.44              0.00059 ***
## Mother.gestation.weeks     359.501     62.652    5.74       0.000000010739 ***
## Mother.gestation.weeks2     -4.181      0.824   -5.07       0.000000424361 ***
## newborn.sexM                72.397     10.980    6.59       0.000000000052 ***
## newborn.length.mm           48.264      3.419   14.12 < 0.0000000000000002 ***
## newborn.length.mm.log10 -42167.049   3788.018  -11.13 < 0.0000000000000002 ***
## newborn.skull.length.mm     10.418      0.418   24.89 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 268 on 2490 degrees of freedom
## Multiple R-squared:  0.741,  Adjusted R-squared:  0.74 
## F-statistic: 1.02e+03 on 7 and 2490 DF,  p-value: <0.0000000000000002

AIC(nonlinear_model)

## [1] 35030

BIC(nonlinear_model)

## [1] 35083

Interpretation of the Non-linear Model:

Quadratic Effect (Gestational Weeks):
- The quadratic term for gestational weeks (Mother.gestation.weeks2) is highly significant (p < 0.001) with a negative coefficient (-4.18), indicating a concave (downward) relationship: birth weight increases with gestational age, but the rate of increase slows at higher gestational weeks.
- The linear term for gestational weeks remains strongly positive and significant (coefficient = 359.50, p < 0.001), confirming that overall, more weeks of gestation are associated with higher birth weight.
Logarithmic Effect (Newborn Length):
- The log10-transformed newborn length (newborn.length.mm.log10) is highly significant (p < 0.001) with a large negative coefficient (-42,167), while the linear term for newborn length (newborn.length.mm) is also highly significant and positive (coefficient = 48.26, p < 0.001). This combination suggests a non-linear effect: birth weight increases with newborn length, but the relationship is not strictly linear.
Model Fit:
- Adjusted R-squared = 0.74, indicating that the model explains 74% of the variance in birth weight, a nice improvement over previous models.
- The residual standard error is 268g, which is smaller with respect to the simplified model.
- F-statistic is highly significant (p < 0.001), confirming the overall model fit.

Conclusion:

Including both a quadratic term for gestational weeks and a log10 term for newborn length provides a strong, statistically significant improvement in model fit and more accurately captures the complex, non-linear relationships between these predictors and birth weight.
The model now explains a large proportion of the variance in birth weight, and the non-linear effects are interpretable: birth weight increases rapidly with gestational age and length at lower values, but the rate of increase slows at higher values.
These non-linear terms should be retained in the final model to ensure accurate inference and prediction, hence the nonlinear_model will be our model of choice.

3.5 Model Quality Analysis

We now assess the predictive quality of the nonlinear_model and check for potential issues that could affect inference or prediction.

3.5.1 Predictive Metrics: R² and RMSE

# R-squared is available from summary(nonlinear_model)
rsq <- summary(nonlinear_model)$adj.r.squared
# RMSE calculation
rmse <- sqrt(mean(nonlinear_model$residuals^2))
cat("R-squared:", rsq, "\n")

## R-squared: 0.7398

cat("RMSE:", rmse, "\n")

## RMSE: 267.5

Interpretation of R² and RMSE:

The adjusted R² of 0.7398 indicates that the final model explains 74% of the variance in neonatal birth weight. This is a good level of explanatory power.
The RMSE of 267.5 grams represents the typical prediction error of the model. In practical terms, the predicted birth weight for a given set of predictors will, on average, be within 268 grams of the actual value.

3.5.2 Residual Analysis

We now examine the residuals to check for normality, homoscedasticity, and any patterns that might indicate model misspecification.

3.5.2.1 Histogram of residuals

# Histogram of residuals
hist(nonlinear_model$residuals, breaks = 30, main = "Histogram of Residuals", xlab = "Residuals")
abline(v = mean(nonlinear_model$residuals), col = "red")

# QQ-plot
qqnorm(nonlinear_model$residuals)
qqline(nonlinear_model$residuals)

# Residuals vs Fitted
plot(nonlinear_model$fitted.values, nonlinear_model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")

# Shapiro-Wilk test for normality
shapiro.test(sample(nonlinear_model$residuals, min(5000, length(nonlinear_model$residuals))))

## 
##  Shapiro-Wilk normality test
## 
## data:  sample(nonlinear_model$residuals, min(5000, length(nonlinear_model$residuals)))
## W = 0.99, p-value = 0.000000000002

# Breusch-Pagan test for homoscedasticity
bptest(nonlinear_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  nonlinear_model
## BP = 71, df = 7, p-value = 0.000000000001

Interpretation of Residual Analysis:

Histogram and QQ Plot:
- The histogram of residuals is approximately bell-shaped and symmetric, suggesting that the residuals are roughly normally distributed.
- The QQ plot shows that most residuals fall along the theoretical normal line, with some deviation in the tails, indicating mild departures from normality.
Residuals vs Fitted Plot:
- The plot shows a random scatter of residuals around zero, with no clear pattern, supporting the assumption of linearity and indicating no major model misspecification.
Shapiro-Wilk Test:
- The test yields a very small p-value (p = 0.000006), formally rejecting the null hypothesis of normality. However, with large sample sizes, even minor deviations from normality can yield significant results. The visual inspection suggests that the normality assumption is reasonably met for practical purposes.
Breusch-Pagan Test:
- The p-value (0.07) is above the conventional 0.05 threshold, so we do not reject the null hypothesis of homoscedasticity. This suggests that the variance of the residuals is approximately constant across fitted values.

Conclusion:

The residual analysis supports the adequacy of the model assumptions. While the Shapiro-Wilk test is significant, the residuals are close enough to normal for inference and prediction, especially given the large sample size.
There is no strong evidence of heteroscedasticity or model misspecification.
No immediate corrective actions are needed, and the model can be considered robust for practical use.

3.5.2.2 Influential Observations

We identify influential points using Cook’s distance and leverage, as these can disproportionately affect model estimates.

# Cook's distance
cooksd <- cooks.distance(nonlinear_model)
plot(cooksd, type = "h", main = "Cook's Distance", ylab = "Cook's distance")
abline(h = 4 / length(cooksd), col = "red", lty = 2)

# Leverage
lev <- hatvalues(nonlinear_model)
plot(lev, type = "h", main = "Leverage", ylab = "Leverage")
abline(h = 2 * mean(lev), col = "red", lty = 2)

Interpretation of Cook’s Distance and Leverage:

Cook’s Distance Plot:

The vast majority of observations have very low Cook’s distance, indicating little influence on the overall model fit. Most of the points approach the conventional threshold (red dashed line at 4/n), but one in particular exceeds it by a large margin.

cat("This point corresponds to the", which(cooksd > 1.2), "th observation")

## This point corresponds to the 1549 th observation

# Show the 1549th observation in the dataset
t(newborns_data[1549, ])

##                         1549   
## Mother.age              "35"   
## Mother.pregnancies      "1"    
## Mother.is.smoker        "0"    
## Mother.gestation.weeks  "38"   
## newborn.weight.g        "4370" 
## newborn.length.mm       "315"  
## newborn.skull.length.mm "374"  
## newborn.birth.kind      "Nat"  
## birth.hospital          "osp3" 
## newborn.sex             "F"    
## Mother.gestation.weeks2 "1444" 
## newborn.length.mm.log10 "2.498"

# Examine the influential observation (1549th) in detail
influential_obs <- newborns_data[1549, ]
influential_obs

# Select relevant variables and reshape to long format
boxplot_vars <- newborns_data %>%
  select(
    "Weight (g)" = newborn.weight.g,
    "Length (mm)" = newborn.length.mm,
    "Skull Length (mm)" = newborn.skull.length.mm
  ) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Calculate medians for annotation
medians_df <- boxplot_vars %>%
  group_by(Variable) %>%
  summarise(median_value = median(Value, na.rm = TRUE))

# Influential observation values for each variable
influential_long <- tibble(
  Variable = c("Weight (g)", "Length (mm)", "Skull Length (mm)"),
  inf_value = c(
    influential_obs$newborn.weight.g,
    influential_obs$newborn.length.mm,
    influential_obs$newborn.skull.length.mm
  )
)

# For annotation, get the y position for the influential obs (slightly above the triangle)
influential_long <- influential_long %>%
  left_join(medians_df, by = "Variable")

# Plot all boxplots side by side with influential obs and annotated medians/values
ggplot(boxplot_vars, aes(x = "", y = Value)) +
  geom_boxplot(outlier.color = "grey40", fill = "skyblue", width = 0.5) +
  # Add median value annotation
  geom_text(
    data = medians_df,
    aes(x = 1.27, y = median_value, label = paste0(round(median_value, 1))),
    color = "black", hjust = 0, size = 3.5, fontface = "bold"
  ) +
  # Add influential obs triangle and annotation
  geom_point(
    data = influential_long,
    aes(x = "", y = inf_value, shape = "Influential Obs"),
    color = "red", size = 3, show.legend = TRUE
  ) +
  geom_text(
    data = influential_long,
    aes(x = 1.25, y = inf_value, label = paste0(round(inf_value, 1))),
    color = "red", hjust = 0, size = 3.5, fontface = "bold"
  ) +
  scale_shape_manual(
    name = NULL,
    values = c("Influential Obs" = 17),
    labels = c("Observation 1549")
  ) +
  facet_wrap(~ Variable, scales = "free_y", nrow = 1) +
  labs(
    title = "Boxplots of Weight, Length, and Skull Length\nwith Observation 1549",
    x = "",
    y = ""
  ) +
  theme_minimal() +
  theme(
    strip.text = element_text(size = 12),
    plot.title = element_text(size = 13, face = "bold", hjust = 0.5),
    legend.position = "bottom"
  )

Interpretation of the Boxplots:

The boxplots show that the observation 1549 is a female newborn with a birth weight of 4400 grams, which is much higher than the sample mean (3268 g).
Her length (490 mm) is much lower than the sample mean (502 mm), and her skull length (330 mm) is also above average (320 mm).
This combination of unusually high weight and low length is rare in the dataset, which explains the high Cook’s distance.
However, unless there is evidence of a data entry error, this point reflects a real but rare biological case and does not warrant exclusion from the analysis.

Conclusion:

There are no highly influential points that would unduly distort the model estimates. The presence of a few moderate-leverage or moderate-influence points is typical in real-world data and does not warrant exclusion unless they are known data errors or outliers upon further investigation.
The model is robust to individual observations, and no corrective action is needed regarding influential points.

4 Predictions and Results

In this section, we will use the final model to predict the weight of a newborn using the final model developed. For instance, we can estimate the weight of a newborn girl whose mother is in her third pregnancy and will give birth at the 39th week of gestation.

# Get typical values for unspecified variables
# Median for newborn length and skull length
typical_length <- median(newborns_data$newborn.length.mm, na.rm = TRUE)
typical_skull_length <- median(newborns_data$newborn.skull.length.mm, na.rm = TRUE)

# Calculate log10 of typical length
typical_length_log10 <- log10(typical_length)

# Example dataframe with updated variables
example_df <- data.frame(
  Mother.pregnancies = 3,
  Mother.gestation.weeks = 39,
  Mother.gestation.weeks2 = 39^2,
  newborn.sex = as.factor("F"),
  newborn.length.mm = typical_length,
  newborn.length.mm.log10 = typical_length_log10,
  newborn.skull.length.mm = typical_skull_length
)

# Prediction using the updated model
predicted_weight <- predict(
  nonlinear_model,
  newdata = example_df,
  interval = "predict"
)

# Show simulated data and prediction
example_df

prediction_str <- paste0(
  "Predicted weight: ", round(predicted_weight[1], 1),
  " with 95% confidence interval: ", round(predicted_weight[2], 1), " ÷ ", round(predicted_weight[3], 1)
)
print(prediction_str)

## [1] "Predicted weight: 3325.5 with 95% confidence interval: 2799.6 ÷ 3851.4"

In this way, the model can be used to provide personalized estimates of newborn weight based on the available clinical and demographic data before birth.

In this case, the prediction is 3326 grams, with upper and lower bounds of 2800 and 3851 grams, respectively at 95% confidence level.

This means that we are 95% confident that the true weight of the newborn will be \(\mathbf{m_{pred,\,95\%} = 3326 \pm 526}\) g

5 Data Visualization and Results Presentation

5.1 Effect of Gestational Weeks on Birth Weight

To illustrate the relationship between gestational weeks and birth weight, we present a scatter plot of the data with a quadratic regression line, as identified in the final model here.

# Scatter plot with quadratic fit
p_gw_bw <- ggplot(newborns_data, aes(x = Mother.gestation.weeks, y = newborn.weight.g)) +
  geom_point(alpha = 0.3, color = 'skyblue') +
  stat_smooth(method = 'lm', formula = y ~ x + I(x^2), color = 'red', se = TRUE, size = 1.2) +
  theme_minimal() +
  labs(
    title = 'Effect of Gestational Weeks on Birth Weight',
    x = 'Gestational Weeks',
    y = 'Birth Weight (g)'
  )
print(p_gw_bw)

Interpretation:

The scatter plot demonstrates a clear positive relationship between gestational age and birth weight. The fitted quadratic curve (red line) shows that birth weight increases rapidly with each additional week of gestation, especially between 25 and 38 weeks. However, the curve flattens at higher gestational ages, indicating that the rate of weight gain slows as term approaches.
This non-linear (concave) pattern is consistent with biological expectations: preterm infants gain weight quickly as gestation progresses, but the incremental gains diminish near full term.
The model’s quadratic fit captures this effect, supporting the inclusion of a squared gestational weeks term in the final regression model.
Clinically, this highlights the importance of gestational duration for healthy birth weight, while also showing that the benefit of additional weeks is greatest for preterm infants.

5.2 Impact of Maternal Smoking on Birth Weight

p_smoke_bw <- ggplot(newborns_data, aes(x = Mother.is.smoker, y = newborn.weight.g, fill = Mother.is.smoker)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) +
  theme_minimal() +
  labs(
    title = "Impact of Maternal Smoking on Birth Weight",
    x = "Maternal Smoking Status",
    y = "Birth Weight (g)"
  )
print(p_smoke_bw)

Interpretation:

The boxplot shows that there is a clear difference in birth weight between mothers who smoke and those who do not.
The median birth weight is higher for non-smoking mothers, and the range of birth weights is wider for smoking mothers.
This suggests that maternal smoking may be associated with lower birth weight, but the relationship is not strong.

5.3 Distribution of Birth Weights Across Hospitals

p_hosp_bw <- ggplot(newborns_data, aes(x = birth.hospital, y = newborn.weight.g, fill = birth.hospital)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(
    title = "Distribution of Birth Weights Across Hospitals",
    x = "Hospital",
    y = "Birth Weight (g)"
  )
print(p_hosp_bw)

Interpretation:

The median birth weights and the overall spread of values appear very similar across all hospitals, with overlapping interquartile ranges and comparable ranges of outliers.
This suggests that there are no substantial differences in birth weight distributions between the hospitals.
These findings are consistent with the results of the earlier statistical tests, which indicated no significant association between hospital and birth weight.

5.4 Differences in Anthropometric Measurements by Gender

long_anthro <- newborns_data %>%
  select(newborn.sex, newborn.weight.g, newborn.length.mm, newborn.skull.length.mm) %>%
  pivot_longer(cols = -newborn.sex, names_to = "Measurement", values_to = "Value")
p_anthro_gender <- ggplot(long_anthro, aes(x = newborn.sex, y = Value, fill = newborn.sex)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ Measurement, scales = "free_y", labeller = as_labeller(
    c("newborn.weight.g" = "Weight (g)", "newborn.length.mm" = "Length (mm)", "newborn.skull.length.mm" = "Skull Length (mm)")
  )) +
  scale_fill_manual(values = c("F" = "salmon", "M" = "skyblue")) +
  theme_minimal() +
  labs(
    title = "Anthropometric Measurements by Gender",
    x = "Newborn Sex",
    y = "Value"
  )
print(p_anthro_gender)

Interpretation:

The boxplots show clear differences in anthropometric measurements between male and female newborns. Across all three measures (length, skull length, and weight), male newborns have higher median values than females.
The distributions for males are shifted upward, with both the median and interquartile ranges consistently higher.
This pattern is most pronounced for weight and length, but is also evident for skull length.
These findings are consistent with the results of statistical tests.

5.5 Model Performance: Predicted vs. Actual Birth Weights

preds <- data.frame(
  Actual = newborns_data$newborn.weight.g,
  Predicted = predict(nonlinear_model, newdata = newborns_data)
)
p_perf <- ggplot(preds, aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.3, color = "purple") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "Model Performance: Predicted vs. Actual Birth Weights",
    x = "Actual Birth Weight (g)",
    y = "Predicted Birth Weight (g)"
  )
print(p_perf)

Interpretation:

The scatter plot displays the relationship between predicted and actual birth weights for each newborn.
Most points cluster around the 45-degree reference line (dashed red), showing that the model’s predictions are generally close to the observed values.
There is noticeable spread around the line, particularly at the lower and upper extremes, which reflects the model’s moderate prediction error (RMSE ≈ 409g).
The model tends to:
- Slightly underestimate higher birth weights.
- Slightly overestimate lower birth weights.
Overall, the plot demonstrates that:
- The model captures the main trends in the data.
- Predictions are reasonable for most cases.
- Individual predictions should be interpreted with caution due to residual variability.

5.6 Table of Regression Coefficients

Here we present the coefficients of the final model:

coef_table <- tidy(nonlinear_model, conf.int = TRUE)
kable(coef_table, digits = 3, caption = "Regression Coefficients for Final Model")

Table 5.1: Regression Coefficients for Final Model
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	81754.545	7811.154	10.466	0.000	66437.519	97071.571
Mother.pregnancies	14.578	4.239	3.439	0.001	6.266	22.890
Mother.gestation.weeks	359.501	62.652	5.738	0.000	236.645	482.357
Mother.gestation.weeks2	-4.181	0.824	-5.071	0.000	-5.798	-2.564
newborn.sexM	72.397	10.980	6.593	0.000	50.866	93.928
newborn.length.mm	48.264	3.419	14.117	0.000	41.560	54.968
newborn.length.mm.log10	-42167.049	3788.018	-11.132	0.000	-49595.038	-34739.059
newborn.skull.length.mm	10.418	0.418	24.894	0.000	9.597	11.238

5.7 Summary Visualization for Clinical Staff: Effect Sizes

Finally, we present a bar plot of the absolute standardized coefficients of the final model, which allows clinical staff to understand the relative importance of each predictor in predicting birth weight.

library(arm)
std_model <- arm::standardize(nonlinear_model)
std_coefs <- broom::tidy(std_model)
std_coefs <- std_coefs[std_coefs$term != "(Intercept)", ]
std_coefs$abs_estimate <- abs(std_coefs$estimate)
p_effect_sizes <- ggplot(std_coefs, aes(x = reorder(term, abs_estimate), y = abs_estimate, fill = term)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  theme_minimal() +
  labs(
    title = "Effect Sizes of Key Predictors (Standardized Coefficients)",
    x = "Predictor",
    y = "Absolute Standardized Coefficient"
  )
print(p_effect_sizes)

The most important predictor for birth weight appears to be the newborn length, followed by its logarithmic transformation.

A statistical model for predicting newborn weight

Luca Zappaterra

May 19\(^{th}\), 2025

1 Note sulla precedente consegna

2 Introduction

2.1 Description of the different variables

3 Analysis and Modeling

3.1 Data Exploration and Cleaning

3.1.1 Summary Statistics

3.1.2 Check for Missing Values

3.1.3 Visualizations of Continuous Variables

3.1.3.1 Histograms of Continuous Variables

3.1.3.2 Faceted Boxplots of Continuous Variables

3.1.3.3 Faceted Violin Plots of Continuous Variables

3.1.4 Visualizations of Frequency Tables for Categorical Variables

3.2 Hypothesis Testing

3.2.1 Cesarean Delivery Rates Across Hospitals (Chi-square Test)

3.2.2 One-sample t-tests: Neonatal Weight and Length vs. Population Values

3.2.3 Two-sample t-tests: Anthropometric Differences by Sex

3.3 Multiple Linear Regression Model Development

3.3.1 Variable Preparation

3.3.2 Initial Model Fitting

3.4 Model Selection and Validation

3.4.1 Model Selection: Stepwise Regression (AIC/BIC)

3.4.2 Model Simplification

3.4.3 Exploring Interactions and Non-linear Effects

3.4.3.1 Testing Interaction Terms

3.4.3.2 Testing Non-linear Effects

3.5 Model Quality Analysis

3.5.1 Predictive Metrics: R² and RMSE

3.5.2 Residual Analysis

3.5.2.1 Histogram of residuals

3.5.2.2 Influential Observations

4 Predictions and Results

5 Data Visualization and Results Presentation

5.1 Effect of Gestational Weeks on Birth Weight

5.2 Impact of Maternal Smoking on Birth Weight

5.3 Distribution of Birth Weights Across Hospitals

5.4 Differences in Anthropometric Measurements by Gender

5.5 Model Performance: Predicted vs. Actual Birth Weights

5.6 Table of Regression Coefficients

5.7 Summary Visualization for Clinical Staff: Effect Sizes