Ciao! 😁
Mi scuso per gli errori scemi, purtroppo alcune delle parti iniziali le avevo buttate giù in bozza mesi fa e lasciate da parte causa lavoro, in fase di revisione non ho minimamente notato le incongruenze sulle variabili categoriche, sarà stato il clima pasquale ahahah.
- Ho corretto le descrizioni delle variabili nel dataset, mi sono accorto anche di qualche altra imprecisione.
- Ho modificato il grafico degli istogrammi delle variabili categoriche che effettivamente aveva troppi colori, ma ho mantenuto i colori nei boxplot e nei violinplot perché a mio avviso lì consentono di associare alle stesse varibaili il plot corretto in maniera semplice passando da un grafico all’altro.
Per quanto riguarda le problematiche di modellistica, si è vista la mia scarsa conoscenza del mondo medicale, non avevo pensato alle misurazioni dall’ecografia!
In ogni caso:
- Ho tenuto le variabili misurabili a ecografia e rimosso quelle inutili
- Ho modificato la parte model selection e ho rivisitato la parte di non-linearità, mi sono accorto di una interessante relazione logaritmica tra peso e lunghezza del neonato che dà un piccolo boost alla varianza spiegata dal modello.
- Ho mostrato anche quale fosse l’osservazione con alta distanza di Cook.
Spero ora sia tutto più chiaro e corretto!
In this project, we will analyze data regarding 2500 newborns and their mothers, to understand if it is possible to create a statistical model to predict their weight based on the collected data. These data are contained in the newborns.csv dataset.
First of all, let’s import the dataset to have a look at its structure:
newborns_data <- read.csv("newborns.csv")
attach(newborns_data)
head(newborns_data)
We can separate the dataset’s features in the following categories:
| Variables | Type | Scale | Description |
|---|---|---|---|
Mother.age |
Numerical, continuous | Ordinal | Mother’s age [in years] |
Mother.pregnancies |
Numerical, discrete | Ratio | Number of sustained pregnancies |
Mother.is.smoker |
Categorical, discrete | Nominal | If the mother is a smoker [0 = no, 1 = yes] |
Mother.gestation.weeks |
Numerical, continuous | Ratio | Time of gestation [in weeks] |
newborn.weight.g |
Numerical, continuous | Ratio | Weight of the newborn [in grams] |
newborn.length.mm |
Numerical, continuous | Ratio | Length of the newborn [in mm] |
newborn.skull.length.mm |
Numerical, continuous | Ratio | Cranial diameter of the newborn [in mm] |
newborn.birth.kind |
Categorical, discrete | Nominal | Type of birth [Nat = natural, Ces = cesarean] |
birth.hospital |
Categorical, discrete | Nominal | Hospital where the birth took place [1, 2 or 3] |
newborn.sex |
Categorical, discrete | Nominal | Newborn’s sex [M or F] |
In this section, we perform an initial exploration and cleaning of the dataset. This includes generating summary statistics, visualizing variable distributions, identifying outliers, and checking for missing values. These steps are essential to ensure the quality and reliability of subsequent statistical analyses and modeling.
To gain an initial understanding of the dataset, we compute summary statistics for all variables. This provides insights into the central tendency, spread, and potential anomalies in the data.
# Select only numeric columns for summary
numeric_cols <- setdiff(names(newborns_data), c("Mother.is.smoker", "newborn.birth.kind", "birth.hospital", "newborn.sex"))
as.data.frame(do.call(cbind, lapply(newborns_data[, numeric_cols], summary)))
Apparently, there is at least one record of a mothers with age 0. Let’s verify what is the problem with this record:
newborns_data %>% filter(Mother.age <= 5)
There is also a record of a mother with age 1. This is probably a data entry error, so we will remove it from the dataset.
newborns_data <- newborns_data %>% filter(Mother.age > 1)
It is important to identify missing values before proceeding with analysis, as they can affect the results and may require imputation or removal.
colSums(is.na(newborns_data))
## Mother.age Mother.pregnancies Mother.is.smoker
## 0 0 0
## Mother.gestation.weeks newborn.weight.g newborn.length.mm
## 0 0 0
## newborn.skull.length.mm newborn.birth.kind birth.hospital
## 0 0 0
## newborn.sex
## 0
We are lucky, there are no missing values in the dataset!
Visualizing the distributions of continuous variables helps to detect skewness, outliers, and unusual patterns. Here, we use ggplot2 for more flexible and publication-ready graphics.
We use faceted histograms to display the distribution of each continuous variable.
# Gather continuous variables into long format
gathered <- newborns_data %>%
select(Mother.age, Mother.gestation.weeks, newborn.weight.g, newborn.length.mm, newborn.skull.length.mm) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(gathered, aes(x = Value)) +
geom_histogram(bins = 20, fill = "skyblue", color = "white") +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Histograms of Continuous Variables")
The histograms above show that most continuous variables are approximately normally distributed, with some skewness in variables such as the number of pregnancies (right-skewed) and gestational weeks (slightly left-skewed). The distributions for newborn length, skull length, and weight are fairly symmetric, while mother’s age is centered around the late 20s to early 30s. There are no obvious extreme outliers, but the number of pregnancies has a long tail, indicating a few mothers with a high number of pregnancies.
To better visualize the spread and outliers for each variable, we plot separate boxplots for each continuous variable using facets.
ggplot(gathered, aes(y = Value, x = "", fill = Variable)) +
geom_boxplot(show.legend = FALSE, outlier.color = "red") +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Faceted Boxplots of Continuous Variables", x = "", y = "Value")
The faceted boxplots above provide a clearer view of the distribution and outliers for each continuous variable:
Overall, the boxplots highlight the presence of outliers in all variables, especially in pregnancies and birth weight. These should be considered in subsequent analyses, as they may influence statistical modeling and inference.
Violin plots provide additional information about the distribution shape and density for each variable, complementing the boxplots.
# Gather continuous variables into long format for violin plots
gathered_violin <- newborns_data %>%
select(Mother.age, Mother.gestation.weeks, newborn.weight.g, newborn.length.mm, newborn.skull.length.mm) %>% # Excluded Mother.pregnancies
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(gathered_violin, aes(y = Value, x = "", fill = Variable)) + # Removed fill from aes
geom_violin(show.legend = FALSE, color = "black") + # Added static fill
facet_wrap(~ Variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Faceted Violin Plots of Continuous Variables", x = "", y = "Value")
From these violin plots, we can draw several additional insights:
Violin plots complement the boxplots and summarize both boxplots and histogram conclusion in a single plot, providing also a richer view of the data’s distribution, helping to confirm the presence of skewness, multimodality, and outliers that may influence further statistical analysis.
To better understand the distribution of categorical variables, we visualize their frequencies using bar plots.
cat_vars <- c("Mother.is.smoker", "newborn.birth.kind", "birth.hospital", "newborn.sex")
cat_data <- newborns_data %>%
mutate(across(all_of(cat_vars), as.character)) %>%
select(all_of(cat_vars)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category")
ggplot(cat_data, aes(x = as.factor(Category))) +
geom_bar(fill = "skyblue", show.legend = FALSE) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.3, size = 3) +
facet_wrap(~ Variable, scales = "free_x") +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
theme_minimal() +
labs(title = "Bar Plots of Categorical Variables", x = "Category", y = "Count")
The bar plots above show the distribution of categorical variables:
These plots help confirm the representativeness and balance of the sample, and highlight the low prevalence of maternal smoking, which may be important for subsequent analysis.
In this section, we conduct statistical hypothesis tests to investigate key research questions about the neonatal dataset. Each test includes a clear statement of hypotheses, selection of the appropriate test statistic, calculation of the p-value, interpretation of results, and supporting visualizations.
Research Question: Are cesarean delivery rates independent of the hospital where the birth took place?
# Chi-square test: Cesarean delivery rates by hospital
# Create contingency table
cesarean_table <- table(newborns_data$newborn.birth.kind, newborns_data$birth.hospital)
# Only consider 'Ces' vs. 'Nat' if both present
# Perform chi-square test
chisq_result <- chisq.test(cesarean_table)
chisq_result
##
## Pearson's Chi-squared test
##
## data: cesarean_table
## X-squared = 1.1, df = 2, p-value = 0.6
# Bar plot of cesarean rates by hospital
cesarean_prop <- prop.table(cesarean_table, 2)
barplot(cesarean_prop, beside = TRUE, legend = TRUE,
col = c("skyblue", "salmon"),
main = "Cesarean vs. Natural Births by Hospital",
xlab = "Hospital", ylab = "Proportion")
Interpretation:
The chi-squared test for independence between hospital and cesarean delivery rates yields a p-value of 0.6, which is much greater than the conventional significance level of 0.05. Therefore, we do not reject the null hypothesis: there is no statistically significant evidence that cesarean delivery rates differ across the three hospitals in our sample.
The bar plot visually confirms this result, showing that the proportions of cesarean and natural births are very similar across all hospitals. There are no substantial differences in cesarean rates between hospitals, supporting the conclusion of the statistical test.
Research Question: Are the mean weight and length of neonates in our sample significantly different from known population values?
According to the World Health Organization, the average weight of a newborn is ~3250 grams taking into account both boys (3300 g) and girls (3200 g), while the average length is ~495 mm (500 mm for boys, 490 mm for girls).
pop_mean_weight <- 3250
pop_mean_length <- 495
# Weight
ttest_weight <- t.test(
newborns_data$newborn.weight.g,
mu = pop_mean_weight,
conf.level = 0.95,
alternative = "two.sided"
)
ttest_weight
##
## One Sample t-test
##
## data: newborns_data$newborn.weight.g
## t = 3.3, df = 2497, p-value = 0.001
## alternative hypothesis: true mean is not equal to 3250
## 95 percent confidence interval:
## 3264 3305
## sample estimates:
## mean of x
## 3284
# Length
ttest_length <- t.test(
newborns_data$newborn.length.mm,
mu = pop_mean_length,
conf.level = 0.95,
alternative = "two.sided"
)
ttest_length
##
## One Sample t-test
##
## data: newborns_data$newborn.length.mm
## t = -0.58, df = 2497, p-value = 0.6
## alternative hypothesis: true mean is not equal to 495
## 95 percent confidence interval:
## 493.7 495.7
## sample estimates:
## mean of x
## 494.7
# Prepare data for ggplot
plot_data <- newborns_data %>%
select(newborn.weight.g, newborn.length.mm) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Set means for lines
means <- data.frame(
Variable = c("newborn.weight.g", "newborn.length.mm"),
pop_mean = c(pop_mean_weight, pop_mean_length),
sample_mean = c(mean(newborns_data$newborn.weight.g), mean(newborns_data$newborn.length.mm))
)
# Plot
p <- ggplot(plot_data, aes(x = Value, fill = Variable)) +
geom_histogram(bins = 20, color = "black", alpha = 0.7) +
facet_wrap(~ Variable, scales = "free", labeller = as_labeller(
c("newborn.weight.g" = "Newborn Weight (g)", "newborn.length.mm" = "Newborn Length (mm)")
)) +
geom_vline(data = means, aes(xintercept = pop_mean, color = "Population Mean"), linetype = "dashed", size = 1) +
geom_vline(data = means, aes(xintercept = sample_mean, color = "Sample Mean"), linetype = "solid", size = 1) +
scale_color_manual(name = NULL, values = c("Population Mean" = "red", "Sample Mean" = "black")) +
scale_fill_manual(values = c("newborn.weight.g" = "skyblue", "newborn.length.mm" = "salmon"), guide = "none") +
theme_minimal() +
theme(legend.position = "bottom") +
labs(x = NULL, y = "Frequency")
print(p)
Interpretation:
For weight, the one-sample t-test yields a p-value of 0.001, which is less than the conventional significance level of 0.05. Therefore, we reject the null hypothesis: the mean weight of newborns in our sample (3284g, 95% CI: 3264–3305g) is significantly different from the population mean, and specifically, it is higher.
For length, the one-sample t-test yields a p-value of 0.6, much greater than 0.05. Therefore, we do not reject the null hypothesis: the mean length of newborns in our sample (494.7mm, 95% CI: 493.7–495.7mm) is not significantly different from the population mean.
The histograms visually confirm these results: the sample mean (solid black line) for weight is shifted to the right of the population mean (dashed red line), while the sample mean for length is almost perfectly aligned with the population mean, indicating strong agreement with the reference value.
Research Question: Do anthropometric measurements (weight, length, skull length) differ significantly between male and female neonates?
# Two-sample t-tests: Anthropometric differences by sex
# Ensure sex is a factor
newborns_data$newborn.sex <- as.factor(newborns_data$newborn.sex)
# Weight
ttest_sex_weight <- t.test(newborn.weight.g ~ newborn.sex, data=newborns_data)
ttest_sex_weight
##
## Welch Two Sample t-test
##
## data: newborn.weight.g by newborn.sex
## t = -12, df = 2489, p-value <0.0000000000000002
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.5 -207.4
## sample estimates:
## mean in group F mean in group M
## 3161 3408
# Visualization
boxplot(newborn.weight.g ~ newborn.sex, data=newborns_data, main="Weight by Sex", col=c("skyblue","salmon"), ylab="Weight (g)")
# Length
ttest_sex_length <- t.test(newborn.length.mm ~ newborn.sex, data=newborns_data)
ttest_sex_length
##
## Welch Two Sample t-test
##
## data: newborn.length.mm by newborn.sex
## t = -9.6, df = 2457, p-value <0.0000000000000002
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -11.939 -7.883
## sample estimates:
## mean in group F mean in group M
## 489.8 499.7
# Visualization
boxplot(newborn.length.mm ~ newborn.sex, data=newborns_data, main="Length by Sex", col=c("skyblue","salmon"), ylab="Length (mm)")
# Skull length
ttest_sex_skull <- t.test(newborn.skull.length.mm ~ newborn.sex, data=newborns_data)
ttest_sex_skull
##
## Welch Two Sample t-test
##
## data: newborn.skull.length.mm by newborn.sex
## t = -7.4, df = 2489, p-value = 0.0000000000001
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -6.111 -3.560
## sample estimates:
## mean in group F mean in group M
## 337.6 342.5
# Visualization
boxplot(newborn.skull.length.mm ~ newborn.sex, data=newborns_data, main="Skull Length by Sex", col=c("skyblue","salmon"), ylab="Skull Length (mm)")
Interpretation:
For all three anthropometric measures—weight, length, and skull length—the two-sample t-tests yield extremely small p-values (all p < 0.0001), much less than the conventional significance level of 0.05. Therefore, we reject the null hypothesis in each case: there are statistically significant differences between male and female newborns for all these measurements.
These results, visually confirmed by the boxplots, indicate that, in this sample, male newborns are on average heavier, longer, and have greater skull length than female newborns, and these differences are highly statistically significant.
In this subsection, we develop a multiple linear regression model to predict neonatal birth weight based on the available clinical and demographic variables. This step builds on the cleaned and explored dataset from previous tasks, and leverages the insights gained from hypothesis testing.
Before fitting the model, we ensure that all variables are appropriately coded. Categorical variables are converted to factors, and we check for the need for any transformations or interaction terms. The main predictors of interest include maternal smoking status, gestational weeks, and other relevant variables identified in previous analyses.
# Convert categorical variables to factors
newborns_data$Mother.is.smoker <- as.factor(newborns_data$Mother.is.smoker)
newborns_data$newborn.birth.kind <- as.factor(newborns_data$newborn.birth.kind)
newborns_data$birth.hospital <- as.factor(newborns_data$birth.hospital)
newborns_data$newborn.sex <- as.factor(newborns_data$newborn.sex)
# Check levels
lapply(newborns_data[, c('Mother.is.smoker', 'newborn.birth.kind', 'birth.hospital', 'newborn.sex')], levels)
## $Mother.is.smoker
## [1] "0" "1"
##
## $newborn.birth.kind
## [1] "Ces" "Nat"
##
## $birth.hospital
## [1] "osp1" "osp2" "osp3"
##
## $newborn.sex
## [1] "F" "M"
We begin by fitting a multiple linear regression model with all relevant predictors. The response variable is neonatal birth weight (in grams). We include maternal age, number of pregnancies, gestational weeks, maternal smoking status, birth kind, hospital, and newborn sex as predictors. We will later assess the significance of each variable and consider possible interactions or non-linear effects.
Note that we are not including all the variables in the model, since some of them (e.g. newborn.birth.kind or birth.hospital) will not be available for prediction while newborns are still in the womb or clearly unrelated to the newborn’s weight.
# Fit initial multiple linear regression model
initial_model <- lm(newborn.weight.g ~ Mother.age + Mother.pregnancies +
Mother.gestation.weeks + Mother.is.smoker +
newborn.sex + newborn.length.mm +
newborn.skull.length.mm,
data = newborns_data)
summary(initial_model)
##
## Call:
## lm(formula = newborn.weight.g ~ Mother.age + Mother.pregnancies +
## Mother.gestation.weeks + Mother.is.smoker + newborn.sex +
## newborn.length.mm + newborn.skull.length.mm, data = newborns_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1160.6 -181.3 -15.7 163.6 2630.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6712.240 141.334 -47.49 < 0.0000000000000002 ***
## Mother.age 0.880 1.149 0.77 0.444
## Mother.pregnancies 11.379 4.677 2.43 0.015 *
## Mother.gestation.weeks 32.947 3.829 8.61 < 0.0000000000000002 ***
## Mother.is.smoker1 -30.396 27.608 -1.10 0.271
## newborn.sexM 78.079 11.213 6.96 0.0000000000042 ***
## newborn.length.mm 10.232 0.301 33.98 < 0.0000000000000002 ***
## newborn.skull.length.mm 10.520 0.427 24.63 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2490 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.726
## F-statistic: 948 on 7 and 2490 DF, p-value: <0.0000000000000002
Interpretation of the Initial Regression Model:
The initial multiple linear regression model, now including neonatal length and skull measurements and excluding birth type and hospital, provides the following insights:
newborn.length.mm: Each additional millimeter in newborn length is associated with an increase of, on average, 10.23 grams in birth weight (p < 2e-16).newborn.skull.length.mm: Each additional millimeter in newborn skull length is associated with an increase of, on average, 10.52 grams in birth weight (p < 2e-16).Mother.gestation.weeks: Each additional week of gestation is associated with an increase of, on average, 32.95 grams in birth weight (p < 2e-16).newborn.sexM: Male newborns are, on average, 78.08 grams heavier than female newborns (p ≈ 4.2e-12).Mother.pregnancies: Each additional pregnancy is associated with an increase of, on average, 11.38 grams in birth weight (p = 0.015).Mother.age: Maternal age does not show a statistically significant association with birth weight in this model (p = 0.444).Mother.is.smoker1: Maternal smoking status does not show a statistically significant association with birth weight in this model (p = 0.271), when accounting for other factors like newborn measurements.newborn.sex, non-smoker for Mother.is.smoker). In this context, an intercept far from a plausible birth weight is common and primarily serves to adjust the model’s baseline.Next steps:
In this subsection, we focus on optimizing the multiple linear regression model for neonatal birth weight prediction. The goal is to select the most parsimonious model using statistical criteria, validate its assumptions, and ensure robust predictive performance.
We use stepwise regression based on AIC (Akaike Information Criterion) to identify the optimal set of predictors. This process iteratively adds or removes variables to minimize the AIC, balancing model fit and complexity. We also compare with BIC (Bayesian Information Criterion) for additional rigor.
# Stepwise selection using AIC
step_model_aic <- step(initial_model, direction = "both", trace = FALSE)
summary(step_model_aic)
##
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks +
## newborn.sex + newborn.length.mm + newborn.skull.length.mm,
## data = newborns_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.4 -181.0 -15.6 163.7 2639.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.725 135.804 -49.20 < 0.0000000000000002 ***
## Mother.pregnancies 12.455 4.342 2.87 0.0042 **
## Mother.gestation.weeks 32.383 3.801 8.52 < 0.0000000000000002 ***
## newborn.sexM 77.981 11.211 6.96 0.0000000000045 ***
## newborn.length.mm 10.245 0.301 34.06 < 0.0000000000000002 ***
## newborn.skull.length.mm 10.541 0.426 24.72 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2492 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.726
## F-statistic: 1.33e+03 on 5 and 2492 DF, p-value: <0.0000000000000002
# Stepwise selection using BIC
n <- nrow(newborns_data)
step_model_bic <- step(initial_model, direction = "both", k = log(n), trace = FALSE)
summary(step_model_bic)
##
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks +
## newborn.sex + newborn.length.mm + newborn.skull.length.mm,
## data = newborns_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.4 -181.0 -15.6 163.7 2639.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.725 135.804 -49.20 < 0.0000000000000002 ***
## Mother.pregnancies 12.455 4.342 2.87 0.0042 **
## Mother.gestation.weeks 32.383 3.801 8.52 < 0.0000000000000002 ***
## newborn.sexM 77.981 11.211 6.96 0.0000000000045 ***
## newborn.length.mm 10.245 0.301 34.06 < 0.0000000000000002 ***
## newborn.skull.length.mm 10.541 0.426 24.72 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2492 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.726
## F-statistic: 1.33e+03 on 5 and 2492 DF, p-value: <0.0000000000000002
Both techniques converged to the same model, where Mother.age and Mother.is.smoker1 are removed. This is not surprising, since they were the least significant predictors in the initial model.
Based on the stepwise selection results, we remove non-significant predictors (such as birth kind and hospital, if not retained) and refit the model. We compare the simplified model to the initial model using AIC, BIC, and adjusted R-squared.
# The model selected by both AIC and BIC stepwise regression
simplified_model <- lm(newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks +
newborn.sex + newborn.length.mm + newborn.skull.length.mm,
data = newborns_data)
summary(simplified_model)
##
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks +
## newborn.sex + newborn.length.mm + newborn.skull.length.mm,
## data = newborns_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.4 -181.0 -15.6 163.7 2639.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.725 135.804 -49.20 < 0.0000000000000002 ***
## Mother.pregnancies 12.455 4.342 2.87 0.0042 **
## Mother.gestation.weeks 32.383 3.801 8.52 < 0.0000000000000002 ***
## newborn.sexM 77.981 11.211 6.96 0.0000000000045 ***
## newborn.length.mm 10.245 0.301 34.06 < 0.0000000000000002 ***
## newborn.skull.length.mm 10.541 0.426 24.72 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2492 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.726
## F-statistic: 1.33e+03 on 5 and 2492 DF, p-value: <0.0000000000000002
AIC(initial_model, step_model_aic, step_model_bic, simplified_model)
BIC(initial_model, step_model_aic, step_model_bic, simplified_model)
Model Simplification Results and Interpretation:
The stepwise regression procedures using both AIC and BIC yielded the identical, more parsimonious model:
Selected Model (AIC and BIC):
Mother.age and Mother.is.smoker from the initial model.Mother.pregnancies: Each additional pregnancy is associated with an increase of, on average, 12.46 grams in birth weight (p = 0.0042).Mother.gestation.weeks: Each additional week of gestation is associated with an increase of, on average, 32.38 grams in birth weight (p < 2e-16).newborn.sexM: Male newborns are, on average, 77.98 grams heavier than female newborns (p ≈ 4.5e-12).newborn.length.mm: Each additional millimeter in newborn length is associated with an increase of, on average, 10.25 grams in birth weight (p < 2e-16).newborn.skull.length.mm: Each additional millimeter in newborn skull length is associated with an increase of, on average, 10.54 grams in birth weight (p < 2e-16).Model Comparison and Considerations:
Mother.age and Mother.is.smoker. This suggests these variables do not significantly improve the model fit once other strong predictors like newborn measurements (length, skull diameter) and gestational age are included.initial_model that included the two now-excluded variables.This simplified model, simplified_model, will now be used for exploring potential interaction terms and non-linear effects.
We now extend the model selection process to consider possible interaction terms and non-linear effects. This allows us to capture more complex relationships between predictors and neonatal birth weight, potentially improving model fit and interpretability.
We test interactions that are clinically plausible, such as:
Maternal smoking × gestational weeksSex × gestational weeks# Add interaction terms to the AIC-selected model
interaction_model <- lm(newborn.weight.g ~ Mother.age +
Mother.pregnancies +
Mother.gestation.weeks +
Mother.is.smoker +
newborn.sex +
newborn.length.mm +
newborn.skull.length.mm +
Mother.gestation.weeks * Mother.is.smoker +
newborn.sex * Mother.gestation.weeks,
data = newborns_data)
summary(interaction_model)
##
## Call:
## lm(formula = newborn.weight.g ~ Mother.age + Mother.pregnancies +
## Mother.gestation.weeks + Mother.is.smoker + newborn.sex +
## newborn.length.mm + newborn.skull.length.mm + Mother.gestation.weeks *
## Mother.is.smoker + newborn.sex * Mother.gestation.weeks,
## data = newborns_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1155.2 -181.0 -15.8 162.5 2629.1
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -6615.822 172.034 -38.46
## Mother.age 0.869 1.149 0.76
## Mother.pregnancies 11.401 4.677 2.44
## Mother.gestation.weeks 30.516 4.629 6.59
## Mother.is.smoker1 795.043 757.696 1.05
## newborn.sexM -194.817 234.995 -0.83
## newborn.length.mm 10.230 0.301 33.97
## newborn.skull.length.mm 10.516 0.427 24.62
## Mother.gestation.weeks:Mother.is.smoker1 -21.049 19.288 -1.09
## Mother.gestation.weeks:newborn.sexM 7.008 6.015 1.17
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Mother.age 0.450
## Mother.pregnancies 0.015 *
## Mother.gestation.weeks 0.000000000053 ***
## Mother.is.smoker1 0.294
## newborn.sexM 0.407
## newborn.length.mm < 0.0000000000000002 ***
## newborn.skull.length.mm < 0.0000000000000002 ***
## Mother.gestation.weeks:Mother.is.smoker1 0.275
## Mother.gestation.weeks:newborn.sexM 0.244
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2488 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.727
## F-statistic: 738 on 9 and 2488 DF, p-value: <0.0000000000000002
AIC(interaction_model)
## [1] 35157
BIC(interaction_model)
## [1] 35221
Interpretation of the Interaction Model:
We also test for non-linear effects, such as a quadratic term for gestational weeks, to see if the relationship with birth weight is better captured by a curve.
First, we plot the correlation matrix of the simplified model variables to check for evidence of non-linear relationships.
# Correlation plot for selected variables (improved readability)
library(GGally)
corr_vars <- newborns_data %>%
select(
"Mother's Age" = Mother.age,
"Pregnancies" = Mother.pregnancies,
"Gest. Weeks" = Mother.gestation.weeks,
"Weight (g)" = newborn.weight.g,
"Length (mm)" = newborn.length.mm,
"Skull L.(mm)" = newborn.skull.length.mm
)
# Custom theme for better readability
custom_theme <- theme_minimal(base_size = 12) +
theme(
strip.text = element_text(size = 8, face = "bold"),
axis.text = element_text(size = 5),
axis.title = element_text(size = 8)
)
# Plot correlation matrix with improved labels and theme
p_corr <- ggpairs(
corr_vars,
lower = list(continuous = wrap("smooth", alpha = 0.3, size = 0.7, color = "skyblue")),
diag = list(continuous = wrap("barDiag", fill = "skyblue", color = "grey40")),
upper = list(continuous = wrap("cor", size = 5, color = "black", family = "mono"))
) +
custom_theme
print(p_corr)
By eye, we can see that length, skull length and gestational weeks seems to have a non-linear relationship with birth weight.
After some tests, the best model is obtained by adding a quadratic term in gestational weeks and a logarithmic term in newborn length.
# Add quadratic term for gestational weeks
newborns_data$Mother.gestation.weeks2 <- newborns_data$Mother.gestation.weeks^2
newborns_data$newborn.length.mm.log10 <- log10(newborns_data$newborn.length.mm)
nonlinear_model <- lm(newborn.weight.g ~ Mother.pregnancies +
Mother.gestation.weeks +
Mother.gestation.weeks2 +
newborn.sex +
newborn.length.mm +
newborn.length.mm.log10 +
newborn.skull.length.mm,
data = newborns_data)
summary(nonlinear_model)
##
## Call:
## lm(formula = newborn.weight.g ~ Mother.pregnancies + Mother.gestation.weeks +
## Mother.gestation.weeks2 + newborn.sex + newborn.length.mm +
## newborn.length.mm.log10 + newborn.skull.length.mm, data = newborns_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1192.6 -183.3 -12.8 163.7 1323.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81754.545 7811.154 10.47 < 0.0000000000000002 ***
## Mother.pregnancies 14.578 4.239 3.44 0.00059 ***
## Mother.gestation.weeks 359.501 62.652 5.74 0.000000010739 ***
## Mother.gestation.weeks2 -4.181 0.824 -5.07 0.000000424361 ***
## newborn.sexM 72.397 10.980 6.59 0.000000000052 ***
## newborn.length.mm 48.264 3.419 14.12 < 0.0000000000000002 ***
## newborn.length.mm.log10 -42167.049 3788.018 -11.13 < 0.0000000000000002 ***
## newborn.skull.length.mm 10.418 0.418 24.89 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 268 on 2490 degrees of freedom
## Multiple R-squared: 0.741, Adjusted R-squared: 0.74
## F-statistic: 1.02e+03 on 7 and 2490 DF, p-value: <0.0000000000000002
AIC(nonlinear_model)
## [1] 35030
BIC(nonlinear_model)
## [1] 35083
Interpretation of the Non-linear Model:
Mother.gestation.weeks2) is highly significant (p < 0.001) with a negative coefficient (-4.18), indicating a concave (downward) relationship: birth weight increases with gestational age, but the rate of increase slows at higher gestational weeks.newborn.length.mm.log10) is highly significant (p < 0.001) with a large negative coefficient (-42,167), while the linear term for newborn length (newborn.length.mm) is also highly significant and positive (coefficient = 48.26, p < 0.001). This combination suggests a non-linear effect: birth weight increases with newborn length, but the relationship is not strictly linear.Conclusion:
nonlinear_model will be our model of choice.We now assess the predictive quality of the nonlinear_model and check for potential issues that could affect inference or prediction.
# R-squared is available from summary(nonlinear_model)
rsq <- summary(nonlinear_model)$adj.r.squared
# RMSE calculation
rmse <- sqrt(mean(nonlinear_model$residuals^2))
cat("R-squared:", rsq, "\n")
## R-squared: 0.7398
cat("RMSE:", rmse, "\n")
## RMSE: 267.5
Interpretation of R² and RMSE:
We now examine the residuals to check for normality, homoscedasticity, and any patterns that might indicate model misspecification.
# Histogram of residuals
hist(nonlinear_model$residuals, breaks = 30, main = "Histogram of Residuals", xlab = "Residuals")
abline(v = mean(nonlinear_model$residuals), col = "red")
# QQ-plot
qqnorm(nonlinear_model$residuals)
qqline(nonlinear_model$residuals)
# Residuals vs Fitted
plot(nonlinear_model$fitted.values, nonlinear_model$residuals,
xlab = "Fitted Values", ylab = "Residuals", main = "Residuals vs Fitted")
abline(h = 0, col = "red")
# Shapiro-Wilk test for normality
shapiro.test(sample(nonlinear_model$residuals, min(5000, length(nonlinear_model$residuals))))
##
## Shapiro-Wilk normality test
##
## data: sample(nonlinear_model$residuals, min(5000, length(nonlinear_model$residuals)))
## W = 0.99, p-value = 0.000000000002
# Breusch-Pagan test for homoscedasticity
bptest(nonlinear_model)
##
## studentized Breusch-Pagan test
##
## data: nonlinear_model
## BP = 71, df = 7, p-value = 0.000000000001
Interpretation of Residual Analysis:
Conclusion:
We identify influential points using Cook’s distance and leverage, as these can disproportionately affect model estimates.
# Cook's distance
cooksd <- cooks.distance(nonlinear_model)
plot(cooksd, type = "h", main = "Cook's Distance", ylab = "Cook's distance")
abline(h = 4 / length(cooksd), col = "red", lty = 2)
# Leverage
lev <- hatvalues(nonlinear_model)
plot(lev, type = "h", main = "Leverage", ylab = "Leverage")
abline(h = 2 * mean(lev), col = "red", lty = 2)
Interpretation of Cook’s Distance and Leverage:
cat("This point corresponds to the", which(cooksd > 1.2), "th observation")
## This point corresponds to the 1549 th observation
# Show the 1549th observation in the dataset
t(newborns_data[1549, ])
## 1549
## Mother.age "35"
## Mother.pregnancies "1"
## Mother.is.smoker "0"
## Mother.gestation.weeks "38"
## newborn.weight.g "4370"
## newborn.length.mm "315"
## newborn.skull.length.mm "374"
## newborn.birth.kind "Nat"
## birth.hospital "osp3"
## newborn.sex "F"
## Mother.gestation.weeks2 "1444"
## newborn.length.mm.log10 "2.498"# Examine the influential observation (1549th) in detail
influential_obs <- newborns_data[1549, ]
influential_obs
# Select relevant variables and reshape to long format
boxplot_vars <- newborns_data %>%
select(
"Weight (g)" = newborn.weight.g,
"Length (mm)" = newborn.length.mm,
"Skull Length (mm)" = newborn.skull.length.mm
) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Calculate medians for annotation
medians_df <- boxplot_vars %>%
group_by(Variable) %>%
summarise(median_value = median(Value, na.rm = TRUE))
# Influential observation values for each variable
influential_long <- tibble(
Variable = c("Weight (g)", "Length (mm)", "Skull Length (mm)"),
inf_value = c(
influential_obs$newborn.weight.g,
influential_obs$newborn.length.mm,
influential_obs$newborn.skull.length.mm
)
)
# For annotation, get the y position for the influential obs (slightly above the triangle)
influential_long <- influential_long %>%
left_join(medians_df, by = "Variable")
# Plot all boxplots side by side with influential obs and annotated medians/values
ggplot(boxplot_vars, aes(x = "", y = Value)) +
geom_boxplot(outlier.color = "grey40", fill = "skyblue", width = 0.5) +
# Add median value annotation
geom_text(
data = medians_df,
aes(x = 1.27, y = median_value, label = paste0(round(median_value, 1))),
color = "black", hjust = 0, size = 3.5, fontface = "bold"
) +
# Add influential obs triangle and annotation
geom_point(
data = influential_long,
aes(x = "", y = inf_value, shape = "Influential Obs"),
color = "red", size = 3, show.legend = TRUE
) +
geom_text(
data = influential_long,
aes(x = 1.25, y = inf_value, label = paste0(round(inf_value, 1))),
color = "red", hjust = 0, size = 3.5, fontface = "bold"
) +
scale_shape_manual(
name = NULL,
values = c("Influential Obs" = 17),
labels = c("Observation 1549")
) +
facet_wrap(~ Variable, scales = "free_y", nrow = 1) +
labs(
title = "Boxplots of Weight, Length, and Skull Length\nwith Observation 1549",
x = "",
y = ""
) +
theme_minimal() +
theme(
strip.text = element_text(size = 12),
plot.title = element_text(size = 13, face = "bold", hjust = 0.5),
legend.position = "bottom"
)
Interpretation of the Boxplots:
Conclusion:
In this section, we will use the final model to predict the weight of a newborn using the final model developed. For instance, we can estimate the weight of a newborn girl whose mother is in her third pregnancy and will give birth at the 39th week of gestation.
# Get typical values for unspecified variables
# Median for newborn length and skull length
typical_length <- median(newborns_data$newborn.length.mm, na.rm = TRUE)
typical_skull_length <- median(newborns_data$newborn.skull.length.mm, na.rm = TRUE)
# Calculate log10 of typical length
typical_length_log10 <- log10(typical_length)
# Example dataframe with updated variables
example_df <- data.frame(
Mother.pregnancies = 3,
Mother.gestation.weeks = 39,
Mother.gestation.weeks2 = 39^2,
newborn.sex = as.factor("F"),
newborn.length.mm = typical_length,
newborn.length.mm.log10 = typical_length_log10,
newborn.skull.length.mm = typical_skull_length
)
# Prediction using the updated model
predicted_weight <- predict(
nonlinear_model,
newdata = example_df,
interval = "predict"
)
# Show simulated data and prediction
example_df
prediction_str <- paste0(
"Predicted weight: ", round(predicted_weight[1], 1),
" with 95% confidence interval: ", round(predicted_weight[2], 1), " ÷ ", round(predicted_weight[3], 1)
)
print(prediction_str)
## [1] "Predicted weight: 3325.5 with 95% confidence interval: 2799.6 ÷ 3851.4"
In this way, the model can be used to provide personalized estimates of newborn weight based on the available clinical and demographic data before birth.
In this case, the prediction is 3326 grams, with upper and lower bounds of 2800 and 3851 grams, respectively at 95% confidence level.
This means that we are 95% confident that the true weight of the newborn will be \(\mathbf{m_{pred,\,95\%} = 3326 \pm 526}\) g
To illustrate the relationship between gestational weeks and birth weight, we present a scatter plot of the data with a quadratic regression line, as identified in the final model here.
# Scatter plot with quadratic fit
p_gw_bw <- ggplot(newborns_data, aes(x = Mother.gestation.weeks, y = newborn.weight.g)) +
geom_point(alpha = 0.3, color = 'skyblue') +
stat_smooth(method = 'lm', formula = y ~ x + I(x^2), color = 'red', se = TRUE, size = 1.2) +
theme_minimal() +
labs(
title = 'Effect of Gestational Weeks on Birth Weight',
x = 'Gestational Weeks',
y = 'Birth Weight (g)'
)
print(p_gw_bw)
Interpretation:
p_smoke_bw <- ggplot(newborns_data, aes(x = Mother.is.smoker, y = newborn.weight.g, fill = Mother.is.smoker)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("0" = "skyblue", "1" = "salmon")) +
theme_minimal() +
labs(
title = "Impact of Maternal Smoking on Birth Weight",
x = "Maternal Smoking Status",
y = "Birth Weight (g)"
)
print(p_smoke_bw)
Interpretation:
p_hosp_bw <- ggplot(newborns_data, aes(x = birth.hospital, y = newborn.weight.g, fill = birth.hospital)) +
geom_boxplot(alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
theme_minimal() +
labs(
title = "Distribution of Birth Weights Across Hospitals",
x = "Hospital",
y = "Birth Weight (g)"
)
print(p_hosp_bw)
Interpretation:
long_anthro <- newborns_data %>%
select(newborn.sex, newborn.weight.g, newborn.length.mm, newborn.skull.length.mm) %>%
pivot_longer(cols = -newborn.sex, names_to = "Measurement", values_to = "Value")
p_anthro_gender <- ggplot(long_anthro, aes(x = newborn.sex, y = Value, fill = newborn.sex)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ Measurement, scales = "free_y", labeller = as_labeller(
c("newborn.weight.g" = "Weight (g)", "newborn.length.mm" = "Length (mm)", "newborn.skull.length.mm" = "Skull Length (mm)")
)) +
scale_fill_manual(values = c("F" = "salmon", "M" = "skyblue")) +
theme_minimal() +
labs(
title = "Anthropometric Measurements by Gender",
x = "Newborn Sex",
y = "Value"
)
print(p_anthro_gender)
Interpretation:
length, skull length, and weight), male newborns have higher median values than females.weight and length, but is also evident for skull length.preds <- data.frame(
Actual = newborns_data$newborn.weight.g,
Predicted = predict(nonlinear_model, newdata = newborns_data)
)
p_perf <- ggplot(preds, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.3, color = "purple") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
theme_minimal() +
labs(
title = "Model Performance: Predicted vs. Actual Birth Weights",
x = "Actual Birth Weight (g)",
y = "Predicted Birth Weight (g)"
)
print(p_perf)
Interpretation:
Here we present the coefficients of the final model:
coef_table <- tidy(nonlinear_model, conf.int = TRUE)
kable(coef_table, digits = 3, caption = "Regression Coefficients for Final Model")
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 81754.545 | 7811.154 | 10.466 | 0.000 | 66437.519 | 97071.571 |
| Mother.pregnancies | 14.578 | 4.239 | 3.439 | 0.001 | 6.266 | 22.890 |
| Mother.gestation.weeks | 359.501 | 62.652 | 5.738 | 0.000 | 236.645 | 482.357 |
| Mother.gestation.weeks2 | -4.181 | 0.824 | -5.071 | 0.000 | -5.798 | -2.564 |
| newborn.sexM | 72.397 | 10.980 | 6.593 | 0.000 | 50.866 | 93.928 |
| newborn.length.mm | 48.264 | 3.419 | 14.117 | 0.000 | 41.560 | 54.968 |
| newborn.length.mm.log10 | -42167.049 | 3788.018 | -11.132 | 0.000 | -49595.038 | -34739.059 |
| newborn.skull.length.mm | 10.418 | 0.418 | 24.894 | 0.000 | 9.597 | 11.238 |
Finally, we present a bar plot of the absolute standardized coefficients of the final model, which allows clinical staff to understand the relative importance of each predictor in predicting birth weight.
library(arm)
std_model <- arm::standardize(nonlinear_model)
std_coefs <- broom::tidy(std_model)
std_coefs <- std_coefs[std_coefs$term != "(Intercept)", ]
std_coefs$abs_estimate <- abs(std_coefs$estimate)
p_effect_sizes <- ggplot(std_coefs, aes(x = reorder(term, abs_estimate), y = abs_estimate, fill = term)) +
geom_col(show.legend = FALSE) +
coord_flip() +
theme_minimal() +
labs(
title = "Effect Sizes of Key Predictors (Standardized Coefficients)",
x = "Predictor",
y = "Absolute Standardized Coefficient"
)
print(p_effect_sizes)
The most important predictor for birth weight appears to be the newborn length, followed by its logarithmic transformation.