suppressPackageStartupMessages({
library(zoo)
library(dplyr)
library(car)
})
library(sandwich)
library(lmtest)
library(ggplot2)
library(knitr)
#Upload database
dati<-read.csv("neonati.csv")
attach(dati)
print("summary dataset")
## [1] "summary dataset"
summary(dati)
## Anni.madre N.gravidanze Fumatrici Gestazione
## Min. : 0.00 Min. : 0.0000 Min. :0.0000 Min. :25.00
## 1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00
## Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00
## Mean :28.16 Mean : 0.9812 Mean :0.0416 Mean :38.98
## 3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00
## Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00
## Peso Lunghezza Cranio Tipo.parto
## Min. : 830 Min. :310.0 Min. :235 Length:2500
## 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Class :character
## Median :3300 Median :500.0 Median :340 Mode :character
## Mean :3284 Mean :494.7 Mean :340
## 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350
## Max. :4930 Max. :565.0 Max. :390
## Ospedale Sesso
## Length:2500 Length:2500
## Class :character Class :character
## Mode :character Mode :character
##
##
##
print("numbers of anomaly Anni.madre's variables less than 10 years: ")
## [1] "numbers of anomaly Anni.madre's variables less than 10 years: "
sum(dati$Anni.madre < 10)
## [1] 2
dati$Anni.madre[dati$Anni.madre < 10] <- NA
summary(dati)
## Anni.madre N.gravidanze Fumatrici Gestazione
## Min. :13.00 Min. : 0.0000 Min. :0.0000 Min. :25.00
## 1st Qu.:25.00 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:38.00
## Median :28.00 Median : 1.0000 Median :0.0000 Median :39.00
## Mean :28.19 Mean : 0.9812 Mean :0.0416 Mean :38.98
## 3rd Qu.:32.00 3rd Qu.: 1.0000 3rd Qu.:0.0000 3rd Qu.:40.00
## Max. :46.00 Max. :12.0000 Max. :1.0000 Max. :43.00
## NA's :2
## Peso Lunghezza Cranio Tipo.parto
## Min. : 830 Min. :310.0 Min. :235 Length:2500
## 1st Qu.:2990 1st Qu.:480.0 1st Qu.:330 Class :character
## Median :3300 Median :500.0 Median :340 Mode :character
## Mean :3284 Mean :494.7 Mean :340
## 3rd Qu.:3620 3rd Qu.:510.0 3rd Qu.:350
## Max. :4930 Max. :565.0 Max. :390
##
## Ospedale Sesso
## Length:2500 Length:2500
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
skew <- moments::skewness(Peso)
kur <- moments::kurtosis(Peso)-3
df <- data.frame(Skewness = skew, Kurtosis = kur)
kable(df)
| Skewness | Kurtosis |
|---|---|
| -0.6470308 | 2.031532 |
shapiro.test(Peso )
##
## Shapiro-Wilk normality test
##
## data: Peso
## W = 0.97066, p-value < 2.2e-16
The results of the analysis of the variable output peso are as follows:
Negatively skewed distribution: higher frequency of values below the average compared to a symmetrical distribution
Leptokurtic distribution: dataset may contain more outliers or extreme values than a normal distribution.
The data do not follow a normal distribution.
Anni.madre: Continuous quantitative variable on a ratio scale, indicating the mother’s age in years.
N.gravidanze: Discrete quantitative variable representing the number of previous pregnancies of the mother.
Fumatrici: Binary qualitative variable (0/1) indicating whether the mother is a smoker (1) or not (0). It is a nominal categorical variable with only two categories.
Gestazione: Continuous quantitative variable on a ratio scale, measures the duration of pregnancy in weeks, crucial for assessing the health of the newborn and prematurity.
Peso: Continuous quantitative variable on a ratio scale, represents the weight of the newborn at birth, an important indicator of neonatal health.
Lunghezza: Continuous quantitative variable, indicates the length of the newborn in centimetres, another parameter of physical development at birth.
Canio: Continuous quantitative variable, measures the newborn’s head circumference, an anthropometric indicator.
Tipo.parto: Nominal qualitative variable, represents the category of delivery type (e.g. natural, caesarean).
Ospedale: Nominal qualitative variable, indicates the hospital where the birth took place, a possible source of variability in the data.
Sex: Nominal qualitative variable, specifies the sex of the newborn (male/female), used as a control variable.
During the initial exploratory analysis of the dataset, two outliers were identified in the ‘Mother’s age’ variable, both less than 10, which is an implausible value. To ensure data quality, these values were replaced with NA. It is chosen to replace outliers with NA instead of deleting them, in order to maintain the integrity of the dataset and allow transparent management of missing data during subsequent analyses.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr = usr))
plot.window(xlim = c(0,1), ylim = c(0,1))
r <- cor(x, y, use = "complete.obs")
txt <- format(c(r, 1), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor))
cex.cor <- 0.8 / strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor)
}
pairs(dati[sapply(dati, is.numeric)], lower.panel=panel.cor, upper.panel=panel.smooth)
Moderate positive correlation between “Gestazione” e “Peso”. A scatterplot that is similar to a logarithmic function. This suggests that weight gain may be faster in the early stages (weeks of gestation) and then slow down.
par(mfrow=c(1,2))
boxplot(Peso)
boxplot(Peso~Sesso)
boxplot(Peso~Fumatrici)
The presence of numerous outliers could indicate significant weight variability within the groups.
t.test(Peso ~ Fumatrici, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -45.61354 145.22674
## sample estimates:
## mean in group 0 mean in group 1
## 3286.153 3236.346
There is a tendency towards a difference in the average weight of newborns between children of mothers who smoke and those who do not, with average values of 3236 g and 3286 g respectively. However, this difference was not statistically significant (t = 1.034, p = 0.3044, t-test), indicating that with the sample available, it is not possible to confirm a real difference. The t-test was considered appropriate given the large sample size, despite the non-normality of the Weight variable.
The test involving the other control variable (Sesso) is shown in Chapter 2.1.1 below.
par(mfrow=c(1,3))
boxplot(Peso~Sesso)
boxplot(Lunghezza~Sesso)
boxplot(Cranio~Sesso)
t.test(Peso ~ Sesso, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M
## 3161.132 3408.215
t.test(Lunghezza ~ Sesso, data = dati)
##
## Welch Two Sample t-test
##
## data: Lunghezza by Sesso
## t = -9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -11.929470 -7.876273
## sample estimates:
## mean in group F mean in group M
## 489.7643 499.6672
t.test(Cranio ~ Sesso, data = dati)
##
## Welch Two Sample t-test
##
## data: Cranio by Sesso
## t = -7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -6.089912 -3.541270
## sample estimates:
## mean in group F mean in group M
## 337.6330 342.4486
The difference in the weight, length and dimensions of the skulls of males and females is statistically significant.
H0: the proportion of caesarean sections is the same in all hospitals
tabella <- table(Ospedale, Tipo.parto)
print(tabella)
## Tipo.parto
## Ospedale Ces Nat
## osp1 242 574
## osp2 254 595
## osp3 232 603
test_chi <- chisq.test(tabella)
print(test_chi)
##
## Pearson's Chi-squared test
##
## data: tabella
## X-squared = 1.0972, df = 2, p-value = 0.5778
There is insufficient statistical evidence to claim that the proportion of caesarean sections differs between the three hospitals in the sample.
Mean weight population newborns from OMS:
mu_m_weight <- 3346.4 #g
mu_f_weight <- 3232.2 #g
mu_m_length <- 498.84 #millimeters
mu_f_length <- 491.48 #millimeters
dati_m <- subset(dati, Sesso == "M")
dati_f <- subset(dati, Sesso == "F")
weight_male_newborns_sample <- mean(dati_m$Peso)
weight_female_newborns_sample <- mean(dati_f$Peso)
Normal test for variable ‘peso’ divided by variable ‘sesso’.
shapiro.test(dati_m$Peso)
##
## Shapiro-Wilk normality test
##
## data: dati_m$Peso
## W = 0.96647, p-value = 2.321e-16
shapiro.test(dati_f$Peso)
##
## Shapiro-Wilk normality test
##
## data: dati_f$Peso
## W = 0.96285, p-value < 2.2e-16
The data do not follow a normal distribution. However, due to the central limit theorem, they can be used for a t-test.
H0: the median weight of males in the sample is equal to 3346.4 g
t.test(x = dati_m$Peso, mu = mu_m_weight)
##
## One Sample t-test
##
## data: dati_m$Peso
## t = 4.4152, df = 1243, p-value = 1.097e-05
## alternative hypothesis: true mean is not equal to 3346.4
## 95 percent confidence interval:
## 3380.748 3435.683
## sample estimates:
## mean of x
## 3408.215
p_value>0.05: I can’t refuse H0.
H0: the median weight of females in the sample is equal to 3232.2 g
t.test(x = dati_f$Peso, mu = mu_f_weight)
##
## One Sample t-test
##
## data: dati_f$Peso
## t = -4.7855, df = 1255, p-value = 1.908e-06
## alternative hypothesis: true mean is not equal to 3232.2
## 95 percent confidence interval:
## 3131.997 3190.267
## sample estimates:
## mean of x
## 3161.132
p_value>0.05: I can’t refuse H0.
H0: the median length of males in the sample is equal to 498.84 mm
t.test(x = dati_f$Lunghezza, mu = mu_m_length)
##
## One Sample t-test
##
## data: dati_f$Lunghezza
## t = -11.682, df = 1255, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 498.84
## 95 percent confidence interval:
## 488.2401 491.2885
## sample estimates:
## mean of x
## 489.7643
p_value>0.05: I can’t refuse H0.
H0: the median weight of females in the sample is equal to 491.48 mm
t.test(x = dati_f$Lunghezza, mu = mu_f_length)
##
## One Sample t-test
##
## data: dati_f$Lunghezza
## t = -2.2083, df = 1255, p-value = 0.0274
## alternative hypothesis: true mean is not equal to 491.48
## 95 percent confidence interval:
## 488.2401 491.2885
## sample estimates:
## mean of x
## 489.7643
p_value>0.05: I can’t refuse H0.
In light of the results obtained: the average weight and length of this sample of newborns are significantly equal to those of the population.
Due to the presence of two NA values, a new dataset has been created that contains only complete observations of the variables to be used in the model. According to the literature and clinical considerations, both Tipo.parto and Ospedale are unlikely to have a significant influence on weight and may confound the model. It is important to apply the BIC test to evaluate the best model.
dati_complete <- dati[complete.cases(dati[, c("N.gravidanze", "Fumatrici", "Gestazione", "Lunghezza", "Cranio", "Sesso", "Anni.madre")]), ]
mod<-lm(Peso ~ . - Tipo.parto - Ospedale, data=dati_complete)
summary(mod)
##
## Call:
## lm(formula = Peso ~ . - Tipo.parto - Ospedale, data = dati_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1160.6 -181.3 -15.7 163.6 2630.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6712.2405 141.3339 -47.492 < 2e-16 ***
## Anni.madre 0.8803 1.1491 0.766 0.444
## N.gravidanze 11.3789 4.6767 2.433 0.015 *
## Fumatrici -30.3958 27.6080 -1.101 0.271
## Gestazione 32.9472 3.8288 8.605 < 2e-16 ***
## Lunghezza 10.2316 0.3011 33.979 < 2e-16 ***
## Cranio 10.5198 0.4271 24.633 < 2e-16 ***
## SessoM 78.0787 11.2132 6.963 4.24e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7264
## F-statistic: 948.3 on 7 and 2490 DF, p-value: < 2.2e-16
#Test for multicollinearity using VIF.
vif(mod)
## Anni.madre N.gravidanze Fumatrici Gestazione Lunghezza Cranio
## 1.189264 1.187447 1.006692 1.694331 2.079749 1.628987
## Sesso
## 1.040493
No variable shows a VIF factor greater than 5, so there is no risk of multicollinearity.
mod1 <- update(mod, . ~ . - Anni.madre)
summary(mod1)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Sesso, data = dati_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1150.24 -181.32 -15.73 162.95 2635.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6682.2637 135.7983 -49.207 < 2e-16 ***
## N.gravidanze 12.6996 4.3470 2.921 0.00352 **
## Fumatrici -30.5728 27.6048 -1.108 0.26818
## Gestazione 32.6437 3.8079 8.573 < 2e-16 ***
## Lunghezza 10.2309 0.3011 33.979 < 2e-16 ***
## Cranio 10.5366 0.4265 24.707 < 2e-16 ***
## SessoM 78.1596 11.2117 6.971 4.01e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2491 degrees of freedom
## Multiple R-squared: 0.7271, Adjusted R-squared: 0.7265
## F-statistic: 1106 on 6 and 2491 DF, p-value: < 2.2e-16
After removing the variable ‘Anni.madre’, which was not statistically significant, the model essentially retains the same explanatory power.
BIC(mod,mod1)
## df BIC
## mod 9 35207.48
## mod1 8 35200.24
The comparison between the models using the Bayesian information criterion (BIC) indicates that the model without the variable ‘Anni.madre’ (model mod1) is the best choice. In fact, mod1 has a lower BIC value (35200.24) than the complete model (mod), which includes ‘Anni.madre’ (BIC = 35207.48). This result suggests that excluding ‘Anni.madre’ allows for a more parsimonious model without significantly compromising the quality of the fit to the data. Consequently, model mod1 was selected as the final model for subsequent analyses.
Before moving on to the next step and select the final model, it would be interesting to consider the combination of different values:
Combination of Sesso and Fumatrici: verify whether the effect of maternal smoking varies according to the sex of the newborn
mod2 <- update(mod1, . ~ . +Sesso:Fumatrici)
summary(mod2)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Sesso + Fumatrici:Sesso, data = dati_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1151.02 -182.19 -16.34 163.02 2636.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6685.1982 135.8916 -49.195 < 2e-16 ***
## N.gravidanze 12.7188 4.3476 2.925 0.00347 **
## Fumatrici -11.9447 40.0956 -0.298 0.76580
## Gestazione 32.6686 3.8086 8.578 < 2e-16 ***
## Lunghezza 10.2326 0.3011 33.980 < 2e-16 ***
## Cranio 10.5378 0.4265 24.706 < 2e-16 ***
## SessoM 79.5898 11.4331 6.961 4.29e-12 ***
## Fumatrici:SessoM -35.3270 55.1420 -0.641 0.52181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7264
## F-statistic: 948.2 on 7 and 2490 DF, p-value: < 2.2e-16
The analysis assessed whether the maternal effect of smoking (‘Fumatici’) on neonatal weight was modified by the sex of the newborn, including an interaction term between the two variables in the model. Despite the plausible clinical hypothesis, the results show that the interaction coefficient is not statistically significant (p = 0.52), indicating no evidence of a differential effect of smoking between males and females. Therefore, it is concluded that the inclusion of the ‘Fumatici:Sesso’ interaction does not make a significant contribution to explaining the variability in neonatal weight in the present dataset.
Combination of Anni.madre and Gestazione: verify the impact of gestation length on newborn weight may vary depending on the mother’s age.
mod3 <- update(mod1, . ~ . + Anni.madre:Gestazione)
summary(mod3)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Sesso + Gestazione:Anni.madre, data = dati_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1162.30 -181.83 -15.07 163.94 2629.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.687e+03 1.359e+02 -49.201 < 2e-16 ***
## N.gravidanze 1.113e+01 4.676e+00 2.380 0.0174 *
## Fumatrici -3.036e+01 2.761e+01 -1.100 0.2715
## Gestazione 3.223e+01 3.835e+00 8.402 < 2e-16 ***
## Lunghezza 1.023e+01 3.011e-01 33.979 < 2e-16 ***
## Cranio 1.052e+01 4.270e-01 24.626 < 2e-16 ***
## SessoM 7.807e+01 1.121e+01 6.963 4.26e-12 ***
## Gestazione:Anni.madre 2.704e-02 2.957e-02 0.914 0.3607
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7265
## F-statistic: 948.4 on 7 and 2490 DF, p-value: < 2.2e-16
The analysis tested the interaction between gestation length and maternal age to assess whether the effect of gestation on neonatal weight varies according to the mother’s age. The interaction coefficient is not statistically significant (p = 0.36), suggesting that maternal age does not modify the association between gestation and weight. The model with interaction does not show a clear improvement in explanatory power compared to the model without interaction, as also evidenced by the similar R² values.
Combination of Fumatrici and Gestazione verify smoking could affect the duration of pregnancy and therefore the birth weight of the baby.
mod4 <- update(mod1, . ~ . + Fumatrici:Gestazione)
summary(mod4)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Sesso + Fumatrici:Gestazione, data = dati_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.83 -181.58 -16.95 163.76 2635.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6699.6947 136.7290 -49.000 < 2e-16 ***
## N.gravidanze 12.7452 4.3470 2.932 0.0034 **
## Fumatrici 795.7016 757.5315 1.050 0.2936
## Gestazione 33.2007 3.8418 8.642 < 2e-16 ***
## Lunghezza 10.2252 0.3011 33.957 < 2e-16 ***
## Cranio 10.5313 0.4265 24.693 < 2e-16 ***
## SessoM 78.7436 11.2241 7.016 2.94e-12 ***
## Fumatrici:Gestazione -21.0469 19.2830 -1.091 0.2752
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared: 0.7273, Adjusted R-squared: 0.7265
## F-statistic: 948.6 on 7 and 2490 DF, p-value: < 2.2e-16
The analysis included an interaction between maternal smoking and gestation length to verify whether the effect of gestation on neonatal weight changes depending on the mother’s smoking status. Although negative, the interaction coefficient is not statistically significant (p = 0.275), indicating that the available data show no evidence of a modifying effect of smoking on the influence of gestation on newborn weight. Consequently, the model with this interaction does not show a substantial improvement over the model without it, as confirmed by the similar value of R². Therefore, the inclusion of this interaction term does not appear justified.
Combination of Fumatrici and Anni.madre: verify whether the impact of smoking on newborn weight varies according to maternal age.
mod5 <- update(mod1, . ~ . + Fumatrici:Anni.madre)
summary(mod5)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Sesso + Fumatrici:Anni.madre, data = dati_complete)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1150.21 -181.76 -15.66 162.96 2636.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6682.5271 135.8277 -49.199 < 2e-16 ***
## N.gravidanze 12.6170 4.3598 2.894 0.00384 **
## Fumatrici -66.7002 144.1633 -0.463 0.64364
## Gestazione 32.6449 3.8087 8.571 < 2e-16 ***
## Lunghezza 10.2335 0.3013 33.961 < 2e-16 ***
## Cranio 10.5336 0.4267 24.685 < 2e-16 ***
## SessoM 78.1696 11.2139 6.971 4.02e-12 ***
## Fumatrici:Anni.madre 1.2768 5.0008 0.255 0.79849
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7264
## F-statistic: 948 on 7 and 2490 DF, p-value: < 2.2e-16
The interaction between “Smokers” and “Mother’s age” has a very small positive coefficient (1.28) and is largely insignificant (p = 0.798), indicating that there is no evidence that the effect of smoking on neonatal weight changes based on the mother’s age in the data analysed. Furthermore, the ‘Smokers’ coefficient alone is not significant, while the other coefficients remain consistent with the previous results.
BIC(mod,mod1,mod2,mod3,mod4,mod5)
## df BIC
## mod 9 35207.48
## mod1 8 35200.24
## mod2 9 35207.65
## mod3 9 35207.23
## mod4 9 35206.87
## mod5 9 35208.00
To select the most suitable model, the Bayesian information criterion (BIC) was used.
Several models were compared, including:
mod1, basic model without interaction terms (8 parameters, BIC = 35200.24);
models with different additional interactions (9 parameters each, BIC between 35206.87 and 35208.00).
The model without interactions (mod1) has the lowest BIC value, indicating a better balance between goodness of fit and model parsimony.
The difference in BIC between mod1 and the other models is greater than 6 points, suggesting substantial evidence in favour of the simpler model.
Furthermore, the added interaction terms are not statistically significant, confirming that their inclusion does not consistently improve the predictive power of the model.
Therefore, it was deemed appropriate to adopt the model without interactions as the final model for the analysis.
par(mfrow=c(2,2))
plot(mod1)
It seems that the homoscedasticity assumption is being respected. Observations are distributed in a random way around the line located at value 0.
Most of the points follow the diagonal line in the Q-Q plot, it means that the model residuals are approximately normal.
Even with the Scale-Location, the model seems to respect the homoscedasticity assumption.
Most of the residuals are randomly distributed around zero, suggesting a good overall fit.
lmtest::bptest(mod1)
##
## studentized Breusch-Pagan test
##
## data: mod1
## BP = 89.841, df = 6, p-value < 2.2e-16
The Breusch-Pagan test reveals significant heteroscedasticity in the model residuals (BP = 89.841, p-value < 2.2e-16), indicating that the variance of the errors is not constant. This could compromise the efficiency of the estimators and the validity of the inferences based on the assumption of homoscedasticity.
lmtest::dwtest(mod1)
##
## Durbin-Watson test
##
## data: mod1
## DW = 1.9539, p-value = 0.1243
## alternative hypothesis: true autocorrelation is greater than 0
The Durbin-Watson test value (DW = 1.9539, p-value = 0.1243) does not detect significant positive autocorrelation in the residuals. Therefore, we can assume that the errors are independent of each other, at least with regard to first-order autocorrelation.
shapiro.test(mod1$residuals)
##
## Shapiro-Wilk normality test
##
## data: mod1$residuals
## W = 0.97416, p-value < 2.2e-16
plot(density(residuals(mod1)))
The Shapiro-Wilk test rejects the hypothesis of normality of the residuals (W = 0.97416, p-value < 2.2e-16), suggesting that the distribution of errors deviates from normality. This may affect the reliability of inferences based on asymptotic and parametric properties.
#leverage
lev<-hatvalues(mod1)
plot(lev)
p<-sum(lev)
n<-length(lev)
soglia=2*p/n
abline(h=soglia,col=2)
lev[lev>soglia]
## 13 15 34 67 89 99
## 0.005703347 0.007069519 0.006759391 0.005897170 0.012912032 0.010439560
## 101 105 106 120 128 131
## 0.007549323 0.010620656 0.014509942 0.010039207 0.011385756 0.007243263
## 134 140 151 155 161 182
## 0.007594636 0.011384225 0.010960845 0.007238247 0.020454182 0.011315127
## 194 204 206 220 234 242
## 0.010838669 0.014580285 0.009489217 0.007449878 0.010852452 0.010236626
## 251 279 294 296 306 310
## 0.010893968 0.010509906 0.005976412 0.010171622 0.010858365 0.028816077
## 312 321 335 378 391 413
## 0.013208271 0.010712780 0.010921940 0.015942776 0.010946451 0.010551081
## 424 442 445 473 492 516
## 0.010779018 0.016102594 0.007514853 0.011311323 0.008307115 0.013189756
## 538 557 567 572 582 587
## 0.012114779 0.010675469 0.010349074 0.010613068 0.011722487 0.008415937
## 592 593 638 656 658 668
## 0.006385650 0.010424563 0.006700518 0.006025333 0.011305424 0.011516875
## 684 697 699 703 748 750
## 0.008843220 0.005873668 0.011104540 0.010766247 0.008582403 0.006967238
## 757 758 765 805 828 913
## 0.008194171 0.011589585 0.006081784 0.014371442 0.007252546 0.005646755
## 928 932 946 947 956 984
## 0.022759679 0.010473049 0.006931361 0.008436113 0.007861539 0.010405359
## 985 1014 1017 1026 1037 1051
## 0.007132846 0.008524032 0.011237743 0.011627834 0.010358071 0.010766679
## 1067 1091 1106 1110 1118 1130
## 0.008474975 0.008958973 0.006006367 0.010413856 0.010364104 0.032015721
## 1170 1175 1181 1188 1219 1227
## 0.010797183 0.010504729 0.005679516 0.006486375 0.030880280 0.011904117
## 1238 1248 1262 1271 1273 1282
## 0.005914871 0.014645989 0.012913934 0.010119833 0.007089555 0.010435158
## 1285 1291 1293 1311 1321 1326
## 0.012202207 0.006161061 0.006100990 0.009678935 0.009356101 0.011063731
## 1333 1357 1368 1379 1385 1397
## 0.011374270 0.006967713 0.011083301 0.010729178 0.012642335 0.011244147
## 1398 1400 1410 1411 1415 1425
## 0.010899348 0.005932589 0.012148484 0.008129552 0.010395643 0.010298620
## 1426 1428 1429 1443 1449 1450
## 0.013001527 0.008248631 0.021765511 0.011273327 0.011023383 0.015266009
## 1458 1473 1480 1505 1512 1525
## 0.010509906 0.010705195 0.011506319 0.013430500 0.011239860 0.010430013
## 1537 1551 1553 1556 1576 1583
## 0.011945912 0.048928246 0.008509382 0.005962153 0.010628946 0.012627911
## 1593 1610 1619 1626 1652 1660
## 0.005670938 0.008730820 0.015089828 0.011105265 0.011302561 0.011289996
## 1672 1686 1691 1701 1712 1718
## 0.010904312 0.009356741 0.010807668 0.010860519 0.006999602 0.007041085
## 1720 1727 1761 1763 1780 1781
## 0.010997587 0.013379628 0.011312431 0.010749679 0.025549378 0.016924209
## 1789 1809 1827 1902 1906 1920
## 0.010798211 0.008713628 0.006080039 0.010577043 0.010376884 0.014345163
## 1929 1933 1971 1977 2003 2016
## 0.012565315 0.010997544 0.012329394 0.006935578 0.011147894 0.013534068
## 2040 2046 2049 2086 2089 2101
## 0.011551043 0.014287924 0.010439560 0.013305448 0.015640851 0.011514295
## 2110 2114 2115 2120 2140 2145
## 0.010609140 0.013334510 0.011783317 0.018667294 0.006263686 0.010268870
## 2146 2148 2149 2157 2175 2200
## 0.005836472 0.007986616 0.013612587 0.005970731 0.032600634 0.011679528
## 2202 2216 2220 2221 2224 2237
## 0.010368854 0.008122716 0.013757467 0.021759446 0.005850764 0.010698612
## 2238 2244 2245 2256 2257 2270
## 0.010965495 0.006995602 0.013623314 0.010583299 0.006186646 0.011003034
## 2282 2285 2307 2317 2337 2353
## 0.011000085 0.010708672 0.013986777 0.007753671 0.014207376 0.012975764
## 2359 2361 2408 2412 2422 2437
## 0.010107658 0.010627641 0.009714482 0.010413631 0.021701995 0.023964759
## 2450 2452 2458 2459 2465 2471
## 0.010627964 0.023848473 0.008509147 0.010213499 0.011318339 0.021049712
## 2478
## 0.005777114
#outliers
plot(rstudent(mod1))
abline(h=c(-2,2))
car::outlierTest(mod1)
## rstudent unadjusted p-value Bonferroni p
## 1551 10.033980 2.9696e-23 7.4182e-20
## 155 5.019634 5.5428e-07 1.3846e-03
## 1306 4.820812 1.5158e-06 3.7865e-03
cook<-cooks.distance(mod1)
plot(cook,ylim = c(0,1))
excludi <- c(1551, 155, 1306)
data_filtrato <- dati_complete[-excludi, ]
mod_filtrato <- update(mod1, data = data_filtrato)
summary(mod_filtrato)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Sesso, data = data_filtrato)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1148.85 -180.85 -14.38 163.68 2651.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6710.0504 135.0574 -49.683 < 2e-16 ***
## N.gravidanze 12.5941 4.3235 2.913 0.00361 **
## Fumatrici -29.7504 27.4235 -1.085 0.27809
## Gestazione 33.4626 3.7932 8.822 < 2e-16 ***
## Lunghezza 10.2909 0.3001 34.286 < 2e-16 ***
## Cranio 10.4362 0.4240 24.612 < 2e-16 ***
## SessoM 77.1850 11.1501 6.922 5.63e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 272.9 on 2488 degrees of freedom
## Multiple R-squared: 0.7304, Adjusted R-squared: 0.7297
## F-statistic: 1123 on 6 and 2488 DF, p-value: < 2.2e-16
dati_complete[excludi, ]
## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio
## 1553 30 4 0 35 4520 520 360
## 155 30 0 0 36 3610 410 330
## 1307 29 0 0 42 3560 510 355
## Tipo.parto Ospedale Sesso
## 1553 Nat osp2 F
## 155 Nat osp1 M
## 1307 Nat osp1 F
The diagnostic analysis of the regression model was initiated by evaluating the leverage of the observations in order to identify those with a potentially excessive influence on the estimated parameters. Observations with leverage above the reference threshold were considered potentially problematic, as they could distort the model estimate.
Subsequently, the distribution of studentised residuals was examined, highlighting those exceeding the limits of ±2, identified as possible outliers in the response values. The application of a specific statistical test made it possible to identify observations with statistically significant residuals.
Cook’s distance was used to measure the overall influence of each observation on the model, integrating information on leverage and residuals. Observations with high values of this distance were recognised as particularly influential.
To assess the effect of influential observations on the validity of the model, the model was reconstructed excluding them. The comparison between the original model and the one obtained from the filtered dataset showed marginal improvements in quality metrics (standard residual and coefficient of determination R^2), while the estimated coefficients were substantially stable.
These results indicate good model robustness, while recognising the significant role of influential observations. Given the plausibility of the values of the possible outlier observations, it was decided to keep these data in the training model.
new_case <- data.frame(
Gestazione = 39,
Lunghezza = mean(dati_f$Lunghezza),
Cranio = mean(dati_f$Cranio),
Sesso = "F",
Fumatrici = 0,
Tipo.parto = "Nat",
N.gravidanze = 3)
predicted_Peso <- predict(mod1,new_case)
new_case$Predicted_Peso <- predicted_Peso
kable(new_case)
| Gestazione | Lunghezza | Cranio | Sesso | Fumatrici | Tipo.parto | N.gravidanze | Predicted_Peso |
|---|---|---|---|---|---|---|---|
| 39 | 489.7643 | 337.633 | F | 0 | Nat | 3 | 3197.163 |
#dataframe with the varabiles: "Gestazione" in range (37,42) and "Fumatrici" (0,1)
data <- expand.grid(
Gestazione = seq(37, 42, by = 1),
Fumatrici = c(0, 1),
Sesso = c("F", "M")
)
# Calculate the averages to be assigned
media_Lunghezza_F <- mean(dati_f$Lunghezza)
media_Cranio_F <- mean(dati_f$Cranio)
media_Lunghezza_M <- mean(dati_m$Lunghezza)
media_Cranio_M <- mean(dati_m$Cranio)
media_N_gravidanze <- mean(dati_complete$N.gravidanze)
# assign the variables Length and Skull based on Gender using ifelse
data <- data %>%
mutate(
Lunghezza = ifelse(Sesso == "F", media_Lunghezza_F, media_Lunghezza_M),
Cranio = ifelse(Sesso == "F", media_Cranio_F, media_Cranio_M),
N.gravidanze = media_N_gravidanze
)
# Prediction calculation
data$Peso_predetto <- predict(mod1, newdata = data)
# show the table
kable(data)
| Gestazione | Fumatrici | Sesso | Lunghezza | Cranio | N.gravidanze | Peso_predetto |
|---|---|---|---|---|---|---|
| 37 | 0 | F | 489.7643 | 337.6330 | 0.9815853 | 3106.243 |
| 38 | 0 | F | 489.7643 | 337.6330 | 0.9815853 | 3138.886 |
| 39 | 0 | F | 489.7643 | 337.6330 | 0.9815853 | 3171.530 |
| 40 | 0 | F | 489.7643 | 337.6330 | 0.9815853 | 3204.174 |
| 41 | 0 | F | 489.7643 | 337.6330 | 0.9815853 | 3236.818 |
| 42 | 0 | F | 489.7643 | 337.6330 | 0.9815853 | 3269.461 |
| 37 | 1 | F | 489.7643 | 337.6330 | 0.9815853 | 3075.670 |
| 38 | 1 | F | 489.7643 | 337.6330 | 0.9815853 | 3108.314 |
| 39 | 1 | F | 489.7643 | 337.6330 | 0.9815853 | 3140.957 |
| 40 | 1 | F | 489.7643 | 337.6330 | 0.9815853 | 3173.601 |
| 41 | 1 | F | 489.7643 | 337.6330 | 0.9815853 | 3206.245 |
| 42 | 1 | F | 489.7643 | 337.6330 | 0.9815853 | 3238.889 |
| 37 | 0 | M | 499.6672 | 342.4486 | 0.9815853 | 3336.457 |
| 38 | 0 | M | 499.6672 | 342.4486 | 0.9815853 | 3369.101 |
| 39 | 0 | M | 499.6672 | 342.4486 | 0.9815853 | 3401.745 |
| 40 | 0 | M | 499.6672 | 342.4486 | 0.9815853 | 3434.388 |
| 41 | 0 | M | 499.6672 | 342.4486 | 0.9815853 | 3467.032 |
| 42 | 0 | M | 499.6672 | 342.4486 | 0.9815853 | 3499.676 |
| 37 | 1 | M | 499.6672 | 342.4486 | 0.9815853 | 3305.885 |
| 38 | 1 | M | 499.6672 | 342.4486 | 0.9815853 | 3338.528 |
| 39 | 1 | M | 499.6672 | 342.4486 | 0.9815853 | 3371.172 |
| 40 | 1 | M | 499.6672 | 342.4486 | 0.9815853 | 3403.816 |
| 41 | 1 | M | 499.6672 | 342.4486 | 0.9815853 | 3436.459 |
| 42 | 1 | M | 499.6672 | 342.4486 | 0.9815853 | 3469.103 |
# Line graph
data$Fumo_descr <- ifelse(data$Fumatrici == 0, "Non fumatrice", "Fumatrice")
data$Gruppo <- interaction(data$Sesso, data$Fumo_descr, sep = "_")
ggplot(data, aes(x = Gestazione, y = Peso_predetto, color = Gruppo)) +
geom_line(linewidth = 1) +
labs(
title = "Impatto combinato di Sesso e Fumo sul Peso Predetto",
x = "Settimane di Gestazione",
y = "Peso Predetto (grammi)",
color = "Gruppo (Sesso_Fumo)"
) +
theme_minimal()
The results show how the Weight variable is influenced by Gender and Smoking. The graphic representation consists of lines showing the predicted weight as a function of weeks of gestation.
It can be seen that the predicted weight for male newborns is always higher than that for females, whether the mother is a smoker or not.
As the variable ‘Settimane di Gestazione’ increases, the weight of newborns increases, whether the mother is a smoker or not.
Ultimately, smoking significantly impacts the weight of newborns. Newborns whose mothers smoke tend to be lighter, regardless of gender.
Dataset and Model Limitations:
The dataset used is limited in size, which may reduce the statistical power of some tests and limit the generalisability of the results.
Preliminary analyses showed that the response variable ‘Peso’ does not follow a normal distribution, exhibiting outliers. The normality and heteroscedasticity of the residuals were also not fully resolved, which may affect the validity of confidence intervals and related statistical tests.
2.3.1 Comments on the best model
The intercept (-6682.26) represents the estimated baseline Peso when all predictor variables are zero; however, this value may not have practical meaning if zero values are outside the observed data range.
For each additional pregnancy (N.gravidanze), the expected Peso increases on average by approximately 12.70 units, holding all other variables constant.
The coefficient for Fumatrici (-30.57) suggests a decrease in Peso for smokers, but this effect is not statistically significant (p = 0.268), so no firm conclusion can be drawn.
Each additional unit of Gestazione is associated with an average increase of about 32.64 units in Peso, controlling for other variables.
Lunghezza has a positive effect, with an average increase of 10.23 units in Peso per unit increase in Lunghezza.
Cranio is also positively associated, with an average Peso increase of 10.54 units for each unit increase in Cranio.
The variable SessoM indicates that males have on average 78.16 units higher Peso compared to females, holding other factors constant.