The project is part of a context of growing attention to the prevention of neonatal complications. The ability to predict the birth weight of newborns represents a key opportunity to improve clinical planning and reduce the risks associated with problematic births, such as premature births or low-birth weight infants.
library(dplyr)
library(lmtest)
library(ggplot2)
library(MASS)
library(ggeffects)
library(moments)
library(knitr)
library(kableExtra)
I begin with an exploration of the dataset:
setwd("~/File eseguibili R/dataset")
data = read.csv("neonati.csv", sep = ',',stringsAsFactors = T)
summary(data)
## Anni.madre N.gravidanze Fumatrici Gestazione
## Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00
## 1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00
## Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00
## Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98
## 3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00
## Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00
## Peso Lunghezza Cranio Tipo.parto Ospedale Sesso
## Min. : 830 Min. :310.0 Min. :235 Ces: 728 osp1:816 F:1256
## 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Nat:1772 osp2:849 M:1244
## Median :3300 Median :500.0 Median :340 osp3:835
## Mean :3284 Mean :494.7 Mean :340
## 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350
## Max. :4930 Max. :565.0 Max. :390
head(data)
## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto
## 1 26 0 0 42 3380 490 325 Nat
## 2 21 2 0 39 3150 490 345 Nat
## 3 34 3 0 38 3640 500 375 Nat
## 4 28 1 0 41 3690 515 365 Nat
## 5 20 0 0 38 3700 480 335 Nat
## 6 32 0 0 40 3200 495 340 Nat
## Ospedale Sesso
## 1 osp3 M
## 2 osp1 F
## 3 osp2 M
## 4 osp2 M
## 5 osp3 F
## 6 osp2 F
str(data)
## 'data.frame': 2500 obs. of 10 variables:
## $ Anni.madre : int 26 21 34 28 20 32 26 25 22 23 ...
## $ N.gravidanze: int 0 2 3 1 0 0 1 0 1 0 ...
## $ Fumatrici : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gestazione : int 42 39 38 41 38 40 39 40 40 41 ...
## $ Peso : int 3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
## $ Lunghezza : int 490 490 500 515 480 495 480 510 500 510 ...
## $ Cranio : int 325 345 375 365 335 340 345 349 335 362 ...
## $ Tipo.parto : Factor w/ 2 levels "Ces","Nat": 2 2 2 2 2 2 2 2 1 1 ...
## $ Ospedale : Factor w/ 3 levels "osp1","osp2",..: 3 1 2 2 3 2 3 1 2 2 ...
## $ Sesso : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...
Let’s take a look at the density and boxplot of the variables, (just the numerical variables)
attach(data)
list_features = c('Anni.madre','N.gravidanze','Gestazione','Peso','Lunghezza','Cranio')
# Set up the plotting area
par(mfrow = c(1, 2))
for (i in list_features){
boxplot(data[[i]], main = i)
#label generation
x_label <- gsub("\\.", " ", tolower(i))
if (i == 'Gestazione'){
x_label = 'mesi di gestazione'
}
if (i == 'N.gravidanze' || i == 'Gestazione' || i == 'Anni.madre'){
hist(data[[i]], main = i, xlab = x_label)
} else{
plot(density(data[[i]]), main = i)
}
}
Looking at the boxplot showing the age of the mother at birth, it seems clear that a couple of value cannot be realistic (age lower than 10 yo). I’ll proceed analyzing these records and later on remove these.
incorrect_values <- data[data$Anni.madre < 10,]
# Print as a formatted table
kable(incorrect_values, caption = "Records with 'Anni.madre' < 10") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1152 | 1 | 1 | 0 | 41 | 3250 | 490 | 350 | Nat | osp2 | F |
| 1380 | 0 | 0 | 0 | 39 | 3060 | 490 | 330 | Nat | osp3 | M |
data <- data[!(data$Anni.madre < 10 ), ]
Even though the other values recorded for these two observations seem correct, (so probably the age was misspelled), I’m going to remove them from the dataset. If Anni.madre was an important variable to determine the weight of the baby, these two records would introduce a huge bias in the estimation of the correct \(\beta\) parameter.
# Create a table and convert it to a data frame
smoking_table <- as.data.frame(table(data$Fumatrici))
# Rename columns for clarity
colnames(smoking_table) <- c("Maternal Smoking", "Frequency")
# Print as a formatted table
kable(smoking_table,
caption = "Distribution of Maternal Smoking",
col.names = c("Maternal Smoking", "Frequency")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Maternal Smoking | Frequency |
|---|---|
| 0 | 2394 |
| 1 | 104 |
boxplot(data$Peso ~ data$Fumatrici,
main = "Boxplot of Birth Weight by Maternal Smoking",
xlab = "Maternal Smoking (0 = Non-Smoker, 1 = Smoker)",
ylab = "Birth Weight (grams)",
col = c("green", "red"),
border = "black")
birth_type_table <- as.data.frame(table(data$Tipo.parto))
colnames(birth_type_table) <- c("Birth Type", "Frequency")
kable(birth_type_table,
caption = "Distribution of Birth Type",
col.names = c("Birth Type", "Frequency")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Birth Type | Frequency |
|---|---|
| Ces | 728 |
| Nat | 1770 |
boxplot(data$Peso ~ data$Tipo.parto,
main = "Boxplot of Birth Weight by Delivery Type",
xlab = "Delivery Type (Ces = Cesarean, Nat = Natural)",
ylab = "Birth Weight (grams)",
col = c("lightblue", "lightpink"),
border = "black")
# Table for Hospital (Ospedale)
hospital_table <- as.data.frame(table(data$Ospedale))
colnames(hospital_table) <- c("Hospital", "Frequency")
kable(hospital_table, caption = "Distribution of Births by Hospital") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Hospital | Frequency |
|---|---|
| osp1 | 816 |
| osp2 | 848 |
| osp3 | 834 |
# Boxplot for Hospital
boxplot(data$Peso ~ data$Ospedale,
main = "Birth Weight by Hospital",
xlab = "Hospital",
ylab = "Birth Weight (grams)",
col = c("lightblue", "lightgreen", "pink"), # Add colors
border = "black")
sex_table <- as.data.frame(table(data$Sesso))
colnames(sex_table) <- c("Sex", "Frequency")
kable(sex_table, caption = "Distribution of Births by Sex") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Sex | Frequency |
|---|---|
| F | 1255 |
| M | 1243 |
# Boxplot for Sex
boxplot(data$Peso ~ data$Sesso,
main = "Birth Weight by Sex",
xlab = "Sex (F = Female, M = Male)",
ylab = "Birth Weight (grams)",
col = c("lightblue", "lightpink"),
border = "black")
Looking at boxplot printed it seems that “Fumatrici” and “Sex” could be useful features to predict the weight of the baby, since the boxplots have a slightly different shape considering these distinctions.
Now in the image below we can see the graphical representation of one variable of the dataset as a function of another to understand any relation. In addition, we also have a measure of the pairwise correlations among the variables. This visualization will be useful when creating the regression model, because it gives us information on which variables should be considered in the model.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(data[,c("Anni.madre","N.gravidanze","Gestazione","Peso","Lunghezza","Cranio")], upper.panel = panel.smooth, lower.panel = panel.cor)
Since we are considering categorical features, assuming the usual hypotesis of random sampling and uncorrelation and independence in the elements of the sample we can use a \(X^2\) test
# Relative frequency of cesarean births by hospital
birth_table <- table(data$Ospedale, data$Tipo.parto)
relative_frequencies <- prop.table(birth_table, margin = 1)[, "Ces"]
df_rf <- data.frame(
Categoria = names(relative_frequencies),
Frequenza = as.numeric(relative_frequencies)
)
# Create table with kable
kable(df_rf,
col.names = c("Hospital", "Relative Frequency"),
digits = 3, # numero di cifre decimali
caption = "Relative Frequency of Hospital Births Considering only Cesarean births") %>%
kable_styling(full_width = FALSE)
| Hospital | Relative Frequency |
|---|---|
| osp1 | 0.297 |
| osp2 | 0.300 |
| osp3 | 0.278 |
# Temporary table
tab <- table(data$Ospedale, data$Tipo.parto)
# Test del chi-quadrato
chi_test <- chisq.test(tab)
chi_test
##
## Pearson's Chi-squared test
##
## data: tab
## X-squared = 1.083, df = 2, p-value = 0.5819
Since we have a p-value of 0.58 we can firmly affirm that the number of Cesearan births is not significantly different from one hospital to the other, we cannot reject \(H_0\). We had also an hint of this looking at the relative frequency of ceasarean birth in each hospital, they are very close one to each other.
Looking at data available online it seems that the medium weight at birth is about 3300 g, while the length is 500mm
Since we don’t know the real distribution of these features (hence neither the real variance) we can proceed using a t_test.
mu_weight = 3300
mu_length = 500
# T Test for weight
peso_test <- t.test(data$Peso, mu = mu_weight)
peso_test
##
## One Sample t-test
##
## data: data$Peso
## t = -1.505, df = 2497, p-value = 0.1324
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
## 3263.577 3304.791
## sample estimates:
## mean of x
## 3284.184
# T Test for length
lunghezza_test <- t.test(data$Lunghezza, mu = mu_length)
lunghezza_test
##
## One Sample t-test
##
## data: data$Lunghezza
## t = -10.069, df = 2497, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
## 493.6628 495.7287
## sample estimates:
## mean of x
## 494.6958
Looking at these results we can conclude that the weights of the babies in this sample is not significantly different from the considered true mean (if we use a classic 5 % significance level), while a different consideration has to be done for the length, since the p-value is close to 0 we can be sure that the mean length is significantly different from 500 mm
Also in this case I proceed with a t-test
# Test t for weight
peso_sesso_test <- t.test(Peso ~ Sesso, data = data)
peso_sesso_test
##
## Welch Two Sample t-test
##
## data: Peso by Sesso
## t = -12.115, df = 2488.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.4841 -207.3844
## sample estimates:
## mean in group F mean in group M
## 3161.061 3408.496
# Test t for length
lunghezza_sesso_test <- t.test(Lunghezza ~ Sesso, data = data)
lunghezza_sesso_test
##
## Welch Two Sample t-test
##
## data: Lunghezza by Sesso
## t = -9.5823, df = 2457.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -11.939001 -7.882672
## sample estimates:
## mean in group F mean in group M
## 489.7641 499.6750
In both tests we reject \(H_0\), so this means that the newborn is significantly different if we consider a male or a female (hence the sex variable will be almost certainly included in the regression model)
I start checking normality in the target variable.
# Compute skewness and round to two decimal places
skewness_value <- round(moments::skewness(data$Peso), 2)
cat("Skewness:", skewness_value, "\n")
## Skewness: -0.65
# Compute excess kurtosis and round to two decimal places
kurtosis_value <- round(moments::kurtosis(data$Peso) - 3, 2)
cat("Excess Kurtosis:", kurtosis_value, "\n")
## Excess Kurtosis: 2.03
shapiro.test(data$Peso)
##
## Shapiro-Wilk normality test
##
## data: data$Peso
## W = 0.97068, p-value < 2.2e-16
As we would expect we have to refuse normality in the target variable (although the distribution has many low values, the skewness is negative overall, and the distribution is a bit peaky, giving positive kurtosis). Therefore we won’t expect normal residuals, when we’ll proceed with residuals validation.
Giving the considerations done in the previous steps, I would expect that the relevant variables to describe the weight are
(the last three variables are a bit correlated one to each other, so I’ll have to check for multicollinearity and if it is valuable to use them all).
# let's start with a model with everything
mod1 = lm(Peso~., data= data)
summary(mod1)
##
## Call:
## lm(formula = Peso ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1123.26 -181.53 -14.45 161.05 2611.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6735.7960 141.4790 -47.610 < 2e-16 ***
## Anni.madre 0.8018 1.1467 0.699 0.4845
## N.gravidanze 11.3812 4.6686 2.438 0.0148 *
## Fumatrici -30.2741 27.5492 -1.099 0.2719
## Gestazione 32.5773 3.8208 8.526 < 2e-16 ***
## Lunghezza 10.2922 0.3009 34.207 < 2e-16 ***
## Cranio 10.4722 0.4263 24.567 < 2e-16 ***
## Tipo.partoNat 29.6335 12.0905 2.451 0.0143 *
## Ospedaleosp2 -11.0912 13.4471 -0.825 0.4096
## Ospedaleosp3 28.2495 13.5054 2.092 0.0366 *
## SessoM 77.5723 11.1865 6.934 5.18e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274 on 2487 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7278
## F-statistic: 668.7 on 10 and 2487 DF, p-value: < 2.2e-16
As we would expect giving to the model all the variables we obtain a bit a mess.
Let’s try again removing Anni.madre and Ospedale which for sure are not a relevant factors (actually Ospedale 3 seems to differ a bit from the other two, but anyway the p value is not so low)
# removing anni.madre and Ospedale
mod2 = update(mod1,~.-Anni.madre-Ospedale)
summary(mod2)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1129.88 -181.36 -16.24 160.63 2634.62
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6708.8092 136.0640 -49.306 < 2e-16 ***
## N.gravidanze 12.9927 4.3439 2.991 0.00281 **
## Fumatrici -31.8823 27.5803 -1.156 0.24780
## Gestazione 32.5970 3.8039 8.569 < 2e-16 ***
## Lunghezza 10.2684 0.3011 34.098 < 2e-16 ***
## Cranio 10.5015 0.4262 24.637 < 2e-16 ***
## Tipo.partoNat 30.4244 12.1041 2.514 0.01201 *
## SessoM 78.1031 11.1998 6.974 3.94e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.4 on 2490 degrees of freedom
## Multiple R-squared: 0.7278, Adjusted R-squared: 0.7271
## F-statistic: 951.3 on 7 and 2490 DF, p-value: < 2.2e-16
I tried removing other variables before Fumatrici, because I would expect it to be relevant, but it’s not (although the estimate is quite big, there is high variability, hence the standard error is very high giving a p value very high).
Instead N.Gravidanze and the type of birth seems to have effects on the description of the target variable.
# remove fumatrici
mod3 = update(mod2,~.-Fumatrici)
summary(mod3)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1129.14 -181.97 -16.26 160.95 2638.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6708.0171 136.0715 -49.298 < 2e-16 ***
## N.gravidanze 12.7356 4.3385 2.935 0.00336 **
## Gestazione 32.3253 3.7969 8.514 < 2e-16 ***
## Lunghezza 10.2833 0.3009 34.177 < 2e-16 ***
## Cranio 10.5063 0.4263 24.648 < 2e-16 ***
## Tipo.partoNat 30.1601 12.1027 2.492 0.01277 *
## SessoM 77.9171 11.1994 6.957 4.42e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.4 on 2491 degrees of freedom
## Multiple R-squared: 0.7277, Adjusted R-squared: 0.727
## F-statistic: 1109 on 6 and 2491 DF, p-value: < 2.2e-16
Here we could already be somewhat happy with the results obtained, the adjusted r squared is still high 0.73, and very close to the one obtained from the model with all the regressors. But since we look for semplicity and a parsimonious model, I’ll try in the next steps at removing other variables and see what happens.
# remove Tipo.parto and n.gravidanze
mod4 = update(mod3,~.-Tipo.parto-N.gravidanze)
summary(mod4)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.1 -184.4 -17.4 163.3 2626.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6651.6732 135.5952 -49.055 < 2e-16 ***
## Gestazione 31.3262 3.7884 8.269 < 2e-16 ***
## Lunghezza 10.2024 0.3009 33.909 < 2e-16 ***
## Cranio 10.6706 0.4247 25.126 < 2e-16 ***
## SessoM 79.1027 11.2205 7.050 2.31e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7257
## F-statistic: 1652 on 4 and 2493 DF, p-value: < 2.2e-16
Removing Tipo.parto and n.gravidanze almost doesn’t affect the results (R-squared still 0.73). So I’ll go on excluding them from the model.
Now I would like to check whether we have multicollinearity in the remaining variables.
vif_values <- car::vif(mod4)
# Convert VIF results to a data frame (fixing the structure)
vif_table <- data.frame(
Variable = rownames(as.data.frame(vif_values)),
VIF = round(as.numeric(vif_values), 2)
)
# Print formatted table
kable(vif_table,
caption = "Variance Inflation Factor (VIF) for Model Variables",
col.names = c("Variable", "VIF")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Variable | VIF |
|---|---|
| Gestazione | 1.65 |
| Lunghezza | 2.07 |
| Cranio | 1.61 |
| Sesso | 1.04 |
We can see a moderate correlation among these values, but the vif values are still far under 5. Let’s try anyway to remove one of these variables and see how the model changes. We could try removing Cranio variable.
mod5 = update(mod4,~.-Cranio)
summary(mod5)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.2 -193.1 -22.9 186.8 3618.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5223.6509 137.7911 -37.910 < 2e-16 ***
## Gestazione 44.4964 4.1994 10.596 < 2e-16 ***
## Lunghezza 13.6010 0.3008 45.215 < 2e-16 ***
## SessoM 90.4500 12.5484 7.208 7.49e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 307.9 on 2494 degrees of freedom
## Multiple R-squared: 0.6568, Adjusted R-squared: 0.6563
## F-statistic: 1591 on 3 and 2494 DF, p-value: < 2.2e-16
Here the R-squared drops significantly, therefore we should keep this feature in the model
Let’s try to see if we have interactions among some variables (i can’t think at any possible interaction that makes sense a part from Sesso combined with other features, let’s try with gestazione)
# i start from model 4 since it seemed the best so far
mod6 = update(mod4,~.+Sesso:Gestazione)
summary(mod6)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso +
## Gestazione:Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1133.40 -183.31 -17.02 162.34 2625.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6540.5979 166.9409 -39.179 < 2e-16 ***
## Gestazione 28.3984 4.5760 6.206 6.35e-10 ***
## Lunghezza 10.2066 0.3009 33.922 < 2e-16 ***
## Cranio 10.6715 0.4247 25.130 < 2e-16 ***
## SessoM -188.8609 235.2241 -0.803 0.422
## Gestazione:SessoM 6.8666 6.0208 1.140 0.254
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.1 on 2492 degrees of freedom
## Multiple R-squared: 0.7263, Adjusted R-squared: 0.7257
## F-statistic: 1322 on 5 and 2492 DF, p-value: < 2.2e-16
This clearly makes the model worse than before.
Let’s check if we have non linear effects: Gestazione seems to have a quadratic effect (especially at low months).
mod7 = update(mod4,~.+I(Gestazione^2))
summary(mod7)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso +
## I(Gestazione^2), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1132.76 -181.60 -15.82 162.79 2649.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4640.6017 899.9572 -5.156 2.71e-07 ***
## Gestazione -80.9464 49.8136 -1.625 0.1043
## Lunghezza 10.3056 0.3041 33.892 < 2e-16 ***
## Cranio 10.7671 0.4265 25.247 < 2e-16 ***
## SessoM 76.9130 11.2530 6.835 1.03e-11 ***
## I(Gestazione^2) 1.4988 0.6631 2.260 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.9 on 2492 degrees of freedom
## Multiple R-squared: 0.7267, Adjusted R-squared: 0.7261
## F-statistic: 1325 on 5 and 2492 DF, p-value: < 2.2e-16
mod8 = update(mod7,~.-Gestazione)
summary(mod8)
##
## Call:
## lm(formula = Peso ~ Lunghezza + Cranio + Sesso + I(Gestazione^2),
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1135.54 -183.52 -17.07 164.34 2630.57
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6089.1448 123.7225 -49.216 < 2e-16 ***
## Lunghezza 10.2139 0.2989 34.173 < 2e-16 ***
## Cranio 10.6907 0.4240 25.213 < 2e-16 ***
## SessoM 78.4604 11.2164 6.995 3.39e-12 ***
## I(Gestazione^2) 0.4244 0.0504 8.421 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2493 degrees of freedom
## Multiple R-squared: 0.7264, Adjusted R-squared: 0.7259
## F-statistic: 1655 on 4 and 2493 DF, p-value: < 2.2e-16
The last model considering quadratic Gestazione has a slighly better adjusted R-squared than the same model with linear Gestazione (mod4).
Now let’s look at the AIC and BIC values of all the models created (I expect model 4 and 8 to be the best).
# Compute AIC and convert it into a data frame
aic_values <- AIC(mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8)
aic_table <- data.frame(Model = rownames(aic_values),
DF = aic_values$df,
AIC = round(aic_values$AIC, 2))
# Print formatted AIC table
kable(aic_table,
caption = "Akaike Information Criterion (AIC) for Model Comparison",
col.names = c("Model", "Degrees of Freedom", "AIC")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Model | Degrees of Freedom | AIC |
|---|---|---|
| mod1 | 12 | 35145.57 |
| mod2 | 9 | 35149.33 |
| mod3 | 8 | 35148.67 |
| mod4 | 6 | 35159.12 |
| mod5 | 5 | 35721.00 |
| mod6 | 7 | 35159.82 |
| mod7 | 7 | 35156.01 |
| mod8 | 6 | 35156.65 |
# Compute BIC and convert it into a data frame
bic_values <- BIC(mod1, mod2, mod3, mod4, mod5, mod6, mod7, mod8)
bic_table <- data.frame(Model = rownames(bic_values),
DF = bic_values$df,
BIC = round(bic_values$BIC, 2))
# Print formatted BIC table
kable(bic_table,
caption = "Bayesian Information Criterion (BIC) for Model Comparison",
col.names = c("Model", "Degrees of Freedom", "BIC")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Model | Degrees of Freedom | BIC |
|---|---|---|
| mod1 | 12 | 35215.45 |
| mod2 | 9 | 35201.73 |
| mod3 | 8 | 35195.25 |
| mod4 | 6 | 35194.06 |
| mod5 | 5 | 35750.11 |
| mod6 | 7 | 35200.58 |
| mod7 | 7 | 35196.77 |
| mod8 | 6 | 35191.59 |
AIC tends to prefer a bit more complex models, hence it give the lowest score to the first model. But if look at BIC it is clear that the best is the last one (one interesting things to notice is how worse the model 5 is, so we cannot remove Cranio although is highly correlated with Lunghezza).
Let’s try to check what we get if we use stepAIC function.
n = nrow(data)
stepwise.mod = MASS::stepAIC(mod1,direction="both",k=log(n))
## Start: AIC=28118.61
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Anni.madre 1 36710 186779904 28111
## - Fumatrici 1 90677 186833870 28112
## - Ospedale 2 687555 187430749 28112
## - N.gravidanze 1 446244 187189438 28117
## - Tipo.parto 1 451073 187194266 28117
## <none> 186743194 28119
## - Sesso 1 3610705 190353899 28159
## - Gestazione 1 5458852 192202046 28183
## - Cranio 1 45318506 232061700 28654
## - Lunghezza 1 87861708 274604902 29074
##
## Step: AIC=28111.28
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Fumatrici 1 91599 186871503 28105
## - Ospedale 2 693914 187473818 28105
## - Tipo.parto 1 452049 187231953 28110
## <none> 186779904 28111
## - N.gravidanze 1 631082 187410986 28112
## + Anni.madre 1 36710 186743194 28119
## - Sesso 1 3617809 190397713 28151
## - Gestazione 1 5424800 192204704 28175
## - Cranio 1 45569477 232349381 28649
## - Lunghezza 1 87852027 274631931 29066
##
## Step: AIC=28104.68
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Ospedale 2 702925 187574428 28098
## - Tipo.parto 1 444404 187315907 28103
## <none> 186871503 28105
## - N.gravidanze 1 608136 187479640 28105
## + Fumatrici 1 91599 186779904 28111
## + Anni.madre 1 37633 186833870 28112
## - Sesso 1 3601860 190473363 28145
## - Gestazione 1 5358199 192229702 28168
## - Cranio 1 45613331 232484834 28642
## - Lunghezza 1 88259386 275130889 29063
##
## Step: AIC=28098.41
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Sesso
##
## Df Sum of Sq RSS AIC
## - Tipo.parto 1 467626 188042054 28097
## <none> 187574428 28098
## - N.gravidanze 1 648873 188223301 28099
## + Ospedale 2 702925 186871503 28105
## + Fumatrici 1 100610 187473818 28105
## + Anni.madre 1 44184 187530244 28106
## - Sesso 1 3644818 191219246 28139
## - Gestazione 1 5457887 193032315 28162
## - Cranio 1 45747094 233321522 28636
## - Lunghezza 1 87955701 275530129 29051
##
## Step: AIC=28096.81
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Sesso
##
## Df Sum of Sq RSS AIC
## <none> 188042054 28097
## - N.gravidanze 1 621053 188663107 28097
## + Tipo.parto 1 467626 187574428 28098
## + Ospedale 2 726146 187315907 28103
## + Fumatrici 1 92548 187949505 28103
## + Anni.madre 1 45366 187996688 28104
## - Sesso 1 3650790 191692844 28137
## - Gestazione 1 5477493 193519547 28161
## - Cranio 1 46098547 234140601 28637
## - Lunghezza 1 87532691 275574744 29044
summary(stepwise.mod)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.37 -180.98 -15.57 163.69 2639.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.7251 135.8036 -49.201 < 2e-16 ***
## N.gravidanze 12.4554 4.3416 2.869 0.00415 **
## Gestazione 32.3827 3.8008 8.520 < 2e-16 ***
## Lunghezza 10.2455 0.3008 34.059 < 2e-16 ***
## Cranio 10.5410 0.4265 24.717 < 2e-16 ***
## SessoM 77.9807 11.2111 6.956 4.47e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2492 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.7265
## F-statistic: 1327 on 5 and 2492 DF, p-value: < 2.2e-16
The automatic selection of the model includes also the N.gravidanze feature, which I had removed in my process of selection of the model. Let’s try modifying model with quadratic Gestazione adding N.gravidanze and see if we could improve BIC value.
mod9 = update(mod8,~.+N.gravidanze)
summary(mod9)
##
## Call:
## lm(formula = Peso ~ Lunghezza + Cranio + Sesso + I(Gestazione^2) +
## N.gravidanze, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1146.79 -181.16 -14.75 165.39 2643.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.100e+03 1.236e+02 -49.355 < 2e-16 ***
## Lunghezza 1.026e+01 2.988e-01 34.326 < 2e-16 ***
## Cranio 1.056e+01 4.258e-01 24.805 < 2e-16 ***
## SessoM 7.731e+01 1.121e+01 6.898 6.64e-12 ***
## I(Gestazione^2) 4.386e-01 5.057e-02 8.674 < 2e-16 ***
## N.gravidanze 1.253e+01 4.340e+00 2.889 0.0039 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.6 on 2492 degrees of freedom
## Multiple R-squared: 0.7273, Adjusted R-squared: 0.7268
## F-statistic: 1329 on 5 and 2492 DF, p-value: < 2.2e-16
bic_values <- BIC(stepwise.mod, mod8, mod9)
# Convert BIC results to a data frame
bic_table <- data.frame(
Model = rownames(bic_values),
DF = bic_values$df,
BIC = round(bic_values$BIC, 2) # Round to 2 decimal places
)
# Print formatted BIC table
kable(bic_table,
caption = "Bayesian Information Criterion (BIC) for Model Comparison",
col.names = c("Model", "Degrees of Freedom", "BIC")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Model | Degrees of Freedom | BIC |
|---|---|---|
| stepwise.mod | 7 | 35193.65 |
| mod8 | 6 | 35191.59 |
| mod9 | 7 | 35191.06 |
The output given from the stepwise mod function is not as good as model 8 or 9. Model 9 has a slightly better BIC the model 8 (also an adjusted R-squared which is a bit better), hence I will conclude that this model 9 is the best among those tried.
The final considered model is \[ \hat{Y}_{\text{peso}} = -6100 + 10.26 \times \text{Lunghezza} + 10.56 \times \text{Cranio} + 77.31 \times \text{SessoM} + 0.4386 \times (\text{Gestazione})^2 + 12.53 \times \text{N.gravidanze}. \] Looking at the \(R^2\) value (0.73) we can say that the model is able to “explain” 73% of the variability of the model, which is quite a good value in a real case.
Now it’s time to make an analysis on the residuals of the model.
par(mfrow=c(1,2))
plot(residuals(mod9))
abline(h=mean(residuals(mod9)),col=2)
plot(density(residuals(mod9)))
bptest(mod9) # no constant variance
##
## studentized Breusch-Pagan test
##
## data: mod9
## BP = 90.909, df = 5, p-value < 2.2e-16
dwtest(mod9) # no autocorrelation
##
## Durbin-Watson test
##
## data: mod9
## DW = 1.9526, p-value = 0.1178
## alternative hypothesis: true autocorrelation is greater than 0
shapiro.test(residuals(mod9)) # here we refuse normality
##
## Shapiro-Wilk normality test
##
## data: residuals(mod9)
## W = 0.97413, p-value < 2.2e-16
The residuals seems to be evenly spread along the zero line, so this is a good sign. Nevertheless some values seem to be far away from the 0 value, in particular there is a record with a huge error value, I will inspect it later.
Breusch-Pagan test gives back a p-value close to 0, hence we must refuse homoskedasticity (probabily this is due to some outliers). Durbin-Watson instead gives a p-value of 0.12, which indicates we cannot refuse the hypotesis of independece among the residuals. There is no autocorrelation among the residuals. Shapiro test has a p-value close to zero indicating that the residuals don’t follow a normal distribution, but we could expect this fact considering that the target variable PESO is not normal either.
Now it’s time to inspec leverages and outliers:
## leverage
lev = hatvalues(mod9)
plot(lev)
p=sum(lev)
#threshold which indicates leverage values
threshold = 2*p/n
abline(h=threshold,col=2)
# outliers
plot(rstudent(mod9))
abline(h=c(-2,2),col=2)
cook = cooks.distance(mod9)
plot(cook)
max_cook <- round(max(cook), 3)
cat("Maximum Cook's Distance:", max_cook, "\n")
## Maximum Cook's Distance: 0.825
From the first plot we can notice many leverage values, there are also quite a lot of outliers in the residuals, but the major inconvenient comes out looking at cook distance plot, there is a value equal to 0.825.
Let’s check in detail this record:
number_influent_record <- which(unname(cook) > 0.5)
influent_record = data[number_influent_record,]
# Print formatted influential record table
kable(influent_record,
caption = "Details of Influential Records (Cook's Distance > 0.5)") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Anni.madre | N.gravidanze | Fumatrici | Gestazione | Peso | Lunghezza | Cranio | Tipo.parto | Ospedale | Sesso | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1551 | 35 | 1 | 0 | 38 | 4370 | 315 | 374 | Nat | osp3 | F |
Looking at this record we can clearly notice that there is an issue with the Lunghezza field. We have quite an heavy baby, but a length of the body which doesn’t match. I don’t have enough information to understand whether this is just an error while recording the data, or if it is due to some strange illness (maybe some form of Nanism). What we know is that for sure it is influent and has a great impact on our model.
Overall the metrics associated with the residuals don’t provide very good results. My opinion is that the model in most cases can capture quite well the weight of a baby, but there are some particular cases (especially the one analyzed before) in which the predicted weight is different from the real one, and this could be due to other factors which are not in the dataset of this project. There could be strange pathologies that can affect a lot PESO feature and hence provide a huge variability in the results, this can be also the reason why the residuals don’t have constant variance.
Let’s now make some prediction using the model described in the previous section. Let’s estimate the weight of a baby girl born from a mother at the third pregnancy at the 39th week.
My model consider also LUNGHEZZA and CRANIO as regressors, hence for these values I will consider the mean.
mean_lunghezza <- mean(data$Lunghezza, na.rm = TRUE)
mean_cranio <- mean(data$Cranio, na.rm = TRUE)
gender = 'F'
n_pregnancy = 3
weeks = 39
# Create a data frame with the required predictors
new_data <- data.frame(
Lunghezza = mean_lunghezza,
Cranio = mean_cranio,
Sesso = factor(gender),
Gestazione = weeks,
N.gravidanze = n_pregnancy
)
predicted_value <- predict(mod9, newdata = new_data)
# Convert to data frame and round the value
pred_table = data.frame(Predicted_Weight = round(predicted_value, 2))
kable(pred_table,
caption = "Predicted Birth Weight Based on Model 9",
col.names = c("Predicted Weight (grams)")) %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| Predicted Weight (grams) |
|---|
| 3270.18 |
As first visualization i would like to have an idea of the difference between observed values and those estimated by the model
data$predicted <- predict(mod9)
data$residuals <- residuals(mod9)
ggplot(data, aes(x = predicted, y = Peso)) +
geom_point(alpha = 0.4) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(x = "Predicted Weight", y = "Observed Weight",
title = "Comparison of Observed and Predicted Weight") +
theme_minimal()
From the plot we can notice how the model perform quite well for weight around the medium - high weights, but tends to underestimate the weight when the observed value is low. We can say that in case of low weights the model is not reliable.
Now, for example, we can check how the Gestazione feature affect the weight.
min(Gestazione)
## [1] 25
max(Gestazione)
## [1] 43
# Restrict gestation weeks from 25 to 43 in steps of 1
eff_gest <- ggpredict(mod9, terms = "Gestazione [25:43 by=1]")
ggplot(eff_gest, aes(x = x, y = predicted)) +
geom_line(color = "blue", linewidth = 1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
fill = "blue", alpha = 0.2) +
labs(x = "Gestation Weeks",
y = "Predicted Birth Weight (grams)",
title = "Effect of Gestation Weeks on Predicted Birth Weight") +
theme_minimal()
As we would expect the weight increases wit the number of the gestation weeks, and we can also notice that we have a greater variability in low and high values, hence (as we could see also in the previous plot) the “extreme” values are very difficult to estimate. This is due to low and high values which are less represented in the dataset and also because in these areas the trend doesn’t appear to be linear.
Now we can see again the boxplot representing how the mother smoking feature affects the birth weight.
ggplot(data, aes(x = factor(Fumatrici), y = Peso, fill = factor(Fumatrici))) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("skyblue", "tomato"),
labels = c("Non-smoker", "Smoker")) +
labs(x = "Mother Smoking Status",
y = "Birth Weight (grams)",
title = "Birth Weight Distribution by Mother Smoking Status") +
theme_minimal()
We can see that it seems that a smoker mother will give birth to slightly lighter baby, but as we have already seen in the previous sections this feature is not statistically relevant.
To conclude, I can say that given the dataset and information provided we are not able to build a 100% satisfactory regression model. The model itself is quite good in the prediction, but has some difficulties with extreme values. These extreme values, not well explained by the model, causes also many problems in the residuals analysis, giving many leverages values, outliers and causing heteroskedasticity. We could try using more complex models then the simple linear regression, or maybe the best solution would be to restrict the use of this model within certain ranges.