## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto
## 1 26 0 0 42 3380 490 325 Nat
## 2 21 2 0 39 3150 490 345 Nat
## 3 34 3 0 38 3640 500 375 Nat
## 4 28 1 0 41 3690 515 365 Nat
## 5 20 0 0 38 3700 480 335 Nat
## 6 32 0 0 40 3200 495 340 Nat
## Ospedale Sesso
## 1 osp3 M
## 2 osp1 F
## 3 osp2 M
## 4 osp2 M
## 5 osp3 F
## 6 osp2 F
| VarName | VarType | N_Null | Min | Max | Mean | Median | N_Distinct |
|---|---|---|---|---|---|---|---|
| Anni.madre | integer | 0 | 0 | 46 | 28.1640 | 28 | 36 |
| N.gravidanze | integer | 0 | 0 | 12 | 0.9812 | 1 | 13 |
| Fumatrici | integer | 0 | 0 | 1 | 0.0416 | 0 | 2 |
| Gestazione | integer | 0 | 25 | 43 | 38.9804 | 39 | 19 |
| Peso | integer | 0 | 830 | 4930 | 3284.0808 | 3300 | 287 |
| Lunghezza | integer | 0 | 310 | 565 | 494.6920 | 500 | 56 |
| Cranio | integer | 0 | 235 | 390 | 340.0292 | 340 | 112 |
| Tipo.parto | character | 0 | NA | NA | NA | NA | 2 |
| Ospedale | character | 0 | NA | NA | NA | NA | 3 |
| Sesso | character | 0 | NA | NA | NA | NA | 2 |
CONCLUSIONS:
The dataset has no missing values. Min, Max, Mean, Median will be useful while analyzing the distribution plots. The distinct count can hel understand the type of the feature (binary, continuos, discrete) and to find unexpted values. (only two distinct values to be expected in fields like [Sesso]).
CONCLUSIONS:
The plots can help us undestard the data distribution, to highlight the presence of uncorrect values and to evaluate the constraints of our analysis. They also provide informations about the balance of classes.
## [1] 2449 10
## Anni.madre N.gravidanze Fumatrici Gestazione
## Min. :13.00 Min. :0.0000 Min. :0.00000 Min. :28.00
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:38.00
## Median :28.00 Median :1.0000 Median :0.00000 Median :39.00
## Mean :28.11 Mean :0.9053 Mean :0.04206 Mean :39.07
## 3rd Qu.:32.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:40.00
## Max. :45.00 Max. :5.0000 Max. :1.00000 Max. :43.00
## Peso Lunghezza Cranio Tipo.parto
## Min. :1500 Min. :315 Min. :275.0 Length:2449
## 1st Qu.:3000 1st Qu.:480 1st Qu.:330.0 Class :character
## Median :3300 Median :500 Median :340.0 Mode :character
## Mean :3305 Mean :496 Mean :340.6
## 3rd Qu.:3620 3rd Qu.:510 3rd Qu.:350.0
## Max. :4930 Max. :565 Max. :390.0
## Ospedale Sesso
## Length:2449 Length:2449
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## 'data.frame': 2449 obs. of 10 variables:
## $ Anni.madre : int 26 21 34 28 20 32 26 25 22 23 ...
## $ N.gravidanze: int 0 2 3 1 0 0 1 0 1 0 ...
## $ Fumatrici : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gestazione : int 42 39 38 41 38 40 39 40 40 41 ...
## $ Peso : int 3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
## $ Lunghezza : int 490 490 500 515 480 495 480 510 500 510 ...
## $ Cranio : int 325 345 375 365 335 340 345 349 335 362 ...
## $ Tipo.parto : chr "Nat" "Nat" "Nat" "Nat" ...
## $ Ospedale : chr "osp3" "osp1" "osp2" "osp2" ...
## $ Sesso : chr "M" "F" "M" "M" ...
The cleaning process has not being based on outlier detection. We used a set of assumprions for filtering the original dataset:
The resultin dataframe [neonati_clean] results to be more focused on more representative situations.
Alfa value always set to 0.05 (5%).
H0: The proportion of cesarean deliveries is the same across hospitals.
H1: At least one hospital differs in its cesarean rate.
##
## Pearson's Chi-squared test
##
## data: tab_osp_parto
## X-squared = 0.95076, df = 2, p-value = 0.6216
The p-value is > 0.05, so there is no statistical evidence to reject the equality of the proportions of cesarean sections. In other words, from the available data, the 3 hospitals do not show significant differences in terms of frequency of caesarean sections.
H0: The proportion of cesarean deliveries is the same across hospitals.
H1: At least one hospital differs in its cesarean rate.
##
## One Sample t-test
##
## data: neonati_clean$Peso
## t = 10.624, df = 2448, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 3200
## 95 percent confidence interval:
## 3285.235 3323.820
## sample estimates:
## mean of x
## 3304.527
##
## One Sample t-test
##
## data: neonati_clean$Lunghezza
## t = 945.76, df = 2448, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 495.0602 496.9096
## sample estimates:
## mean of x
## 495.9849
p-value < 0.05, suggesting that the mean of our sample is different from the hypothesized parameters. This implies that newborns in the dataset, on average, do not exactly match the standard values ââof 3200 g and/or 50 cm, with possible clinical implications.
H0: Mean(Peso or Lunghezza) is the same for M and F.
H1: At least one differs.
##
## Welch Two Sample t-test
##
## data: Peso by Sesso
## t = -12.188, df = 2446.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -270.3458 -195.4077
## sample estimates:
## mean in group F mean in group M
## 3188.517 3421.393
##
## Welch Two Sample t-test
##
## data: Lunghezza by Sesso
## t = -9.9083, df = 2434.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -10.975379 -7.348861
## sample estimates:
## mean in group F mean in group M
## 491.4207 500.5828
p-value < 0.05, therefore there is a statistically significant difference in weight and length between males and females, in line with many researches documenting slight anthropometric differences by sex at birth.
##
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione + N.gravidanze +
## Fumatrici + Sesso + Ospedale + Tipo.parto, data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1477.66 -276.38 -18.78 260.85 1885.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2643.814 210.268 -12.574 < 2e-16 ***
## Anni.madre 2.177 1.740 1.251 0.21096
## Gestazione 147.279 5.104 28.856 < 2e-16 ***
## N.gravidanze 35.192 8.545 4.118 3.94e-05 ***
## Fumatrici1 -114.589 41.197 -2.781 0.00545 **
## SessoM 167.497 16.645 10.063 < 2e-16 ***
## Ospedaleosp2 2.022 20.239 0.100 0.92044
## Ospedaleosp3 18.354 20.296 0.904 0.36593
## Tipo.partoNat 22.767 18.120 1.256 0.20907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2440 degrees of freedom
## Multiple R-squared: 0.3004, Adjusted R-squared: 0.2981
## F-statistic: 130.9 on 8 and 2440 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione * Sesso + N.gravidanze +
## Fumatrici + Ospedale + Tipo.parto, data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1474.06 -276.47 -19.91 261.13 1887.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2507.180 274.584 -9.131 < 2e-16 ***
## Anni.madre 2.181 1.740 1.254 0.21001
## Gestazione 143.745 6.850 20.986 < 2e-16 ***
## SessoM -139.684 397.314 -0.352 0.72519
## N.gravidanze 35.143 8.546 4.112 4.05e-05 ***
## Fumatrici1 -115.603 41.222 -2.804 0.00508 **
## Ospedaleosp2 2.340 20.245 0.116 0.90798
## Ospedaleosp3 18.990 20.315 0.935 0.34998
## Tipo.partoNat 23.191 18.130 1.279 0.20097
## Gestazione:SessoM 7.859 10.155 0.774 0.43911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2439 degrees of freedom
## Multiple R-squared: 0.3005, Adjusted R-squared: 0.298
## F-statistic: 116.4 on 9 and 2439 DF, p-value: < 2.2e-16
Interactions: linear_model_interactions adds, for example, Gestation * Sex, testing whether the effect of gestation length varies by sex.
## Start: AIC=29451.11
## Peso ~ Anni.madre + Gestazione + N.gravidanze + Fumatrici + Sesso +
## Ospedale + Tipo.parto
##
## Df Sum of Sq RSS AIC
## - Ospedale 2 164928 406161909 29448
## - Anni.madre 1 260506 406257487 29451
## - Tipo.parto 1 262678 406259659 29451
## <none> 405996981 29451
## - Fumatrici 1 1287312 407284293 29457
## - N.gravidanze 1 2822268 408819249 29466
## - Sesso 1 16850123 422847104 29549
## - Gestazione 1 138549431 544546412 30168
##
## Step: AIC=29448.11
## Peso ~ Anni.madre + Gestazione + N.gravidanze + Fumatrici + Sesso +
## Tipo.parto
##
## Df Sum of Sq RSS AIC
## - Anni.madre 1 269445 406431354 29448
## - Tipo.parto 1 271901 406433810 29448
## <none> 406161909 29448
## + Ospedale 2 164928 405996981 29451
## - Fumatrici 1 1305624 407467533 29454
## - N.gravidanze 1 2871020 409032929 29463
## - Sesso 1 16873019 423034928 29546
## - Gestazione 1 138817827 544979736 30166
##
## Step: AIC=29447.73
## Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso + Tipo.parto
##
## Df Sum of Sq RSS AIC
## - Tipo.parto 1 278357 406709711 29447
## <none> 406431354 29448
## + Anni.madre 1 269445 406161909 29448
## + Ospedale 2 173867 406257487 29451
## - Fumatrici 1 1327235 407758589 29454
## - N.gravidanze 1 4221000 410652354 29471
## - Sesso 1 16949646 423381000 29546
## - Gestazione 1 139220821 545652175 30167
##
## Step: AIC=29447.41
## Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso
##
## Df Sum of Sq RSS AIC
## <none> 406709711 29447
## + Tipo.parto 1 278357 406431354 29448
## + Anni.madre 1 275901 406433810 29448
## + Ospedale 2 183531 406526181 29450
## - Fumatrici 1 1301092 408010803 29453
## - N.gravidanze 1 4132307 410842018 29470
## - Sesso 1 16936415 423646126 29545
## - Gestazione 1 139157084 545866796 30166
##
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso,
## data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1477.05 -274.13 -19.28 261.66 1864.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2537.500 198.084 -12.810 < 2e-16 ***
## Gestazione 146.614 5.070 28.918 < 2e-16 ***
## N.gravidanze 39.187 7.864 4.983 6.69e-07 ***
## Fumatrici1 -115.132 41.175 -2.796 0.00521 **
## SessoM 167.892 16.642 10.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared: 0.2991, Adjusted R-squared: 0.298
## F-statistic: 260.8 on 4 and 2444 DF, p-value: < 2.2e-16
Stepwise model: gradually eliminates (or adds) non-significant variables based on AIC, simplifying the model. If some variables are not statistically significant, they will be removed, resulting in a more parsimonious model.
##
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione + N.gravidanze +
## Fumatrici + Sesso + Ospedale + Tipo.parto, data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1477.66 -276.38 -18.78 260.85 1885.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2643.814 210.268 -12.574 < 2e-16 ***
## Anni.madre 2.177 1.740 1.251 0.21096
## Gestazione 147.279 5.104 28.856 < 2e-16 ***
## N.gravidanze 35.192 8.545 4.118 3.94e-05 ***
## Fumatrici1 -114.589 41.197 -2.781 0.00545 **
## SessoM 167.497 16.645 10.063 < 2e-16 ***
## Ospedaleosp2 2.022 20.239 0.100 0.92044
## Ospedaleosp3 18.354 20.296 0.904 0.36593
## Tipo.partoNat 22.767 18.120 1.256 0.20907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2440 degrees of freedom
## Multiple R-squared: 0.3004, Adjusted R-squared: 0.2981
## F-statistic: 130.9 on 8 and 2440 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Peso ~ Anni.madre + Gestazione * Sesso + N.gravidanze +
## Fumatrici + Ospedale + Tipo.parto, data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1474.06 -276.47 -19.91 261.13 1887.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2507.180 274.584 -9.131 < 2e-16 ***
## Anni.madre 2.181 1.740 1.254 0.21001
## Gestazione 143.745 6.850 20.986 < 2e-16 ***
## SessoM -139.684 397.314 -0.352 0.72519
## N.gravidanze 35.143 8.546 4.112 4.05e-05 ***
## Fumatrici1 -115.603 41.222 -2.804 0.00508 **
## Ospedaleosp2 2.340 20.245 0.116 0.90798
## Ospedaleosp3 18.990 20.315 0.935 0.34998
## Tipo.partoNat 23.191 18.130 1.279 0.20097
## Gestazione:SessoM 7.859 10.155 0.774 0.43911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2439 degrees of freedom
## Multiple R-squared: 0.3005, Adjusted R-squared: 0.298
## F-statistic: 116.4 on 9 and 2439 DF, p-value: < 2.2e-16
We will use two feature selections algorithms to select a subset of the original features in the dateset, the model we will choose from will be:
⢠initial_model: combines the variables and the possible Gestation:Sex interaction.
⢠AIC_model: reduces complexity by penalizing less than BIC, often by including more variables.
⢠BIC_model: penalizes more strongly, preferring simpler models. Prefered for interpretability.
##
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso,
## data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1477.05 -274.13 -19.28 261.66 1864.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2537.500 198.084 -12.810 < 2e-16 ***
## Gestazione 146.614 5.070 28.918 < 2e-16 ***
## N.gravidanze 39.187 7.864 4.983 6.69e-07 ***
## Fumatrici1 -115.132 41.175 -2.796 0.00521 **
## SessoM 167.892 16.642 10.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared: 0.2991, Adjusted R-squared: 0.298
## F-statistic: 260.8 on 4 and 2444 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso,
## data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1477.05 -274.13 -19.28 261.66 1864.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2537.500 198.084 -12.810 < 2e-16 ***
## Gestazione 146.614 5.070 28.918 < 2e-16 ***
## N.gravidanze 39.187 7.864 4.983 6.69e-07 ***
## Fumatrici1 -115.132 41.175 -2.796 0.00521 **
## SessoM 167.892 16.642 10.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared: 0.2991, Adjusted R-squared: 0.298
## F-statistic: 260.8 on 4 and 2444 DF, p-value: < 2.2e-16
## df AIC
## modello_iniziale 11 36404.47
## modello_AIC 6 36399.37
## modello_BIC 6 36399.37
## df BIC
## modello_iniziale 11 36468.31
## modello_AIC 6 36434.19
## modello_BIC 6 36434.19
CONCLUSIONS:
In the project we evaluated the regression models by comparing the AIC and BIC values. In summary:
We start from an initial model (complete with all variables and, if desired, interactions).
stepAIC() is applied to minimize the AIC, obtaining a model (for example called model_AIC) in which variables are removed or added according to the Akaike Information Criterion.
StepAIC(âŚ, k = log(n)) is applied to minimize the BIC, obtaining a model (called model_BIC) that tends to be more parsimonious, because the BIC (Bayesian Information Criterion) penalizes the number of parameters more.
To compare the models and choose the best one, we considered:
Fit index: in particular R^2 (or Adjusted R^2) and the standard error of the residuals.
Information criteria: AIC and BIC (penalize the complexity of the model).
Simplicity and interpretability: a model with fewer variables can be preferred, if the loss of accuracy is limited.
Accuracy: Adj. R^2 is approximately 0.298 in all models, a sign that the explanatory power changes little between the full and the reduced version.
Complexity penalty: AIC and BIC support the reduced version, in which 4 variables remain (Gestation, No. pregnancies, Smokers, Sex).
Interpretability: the reduced model is more parsimonious (fewer coefficients) and maintains practically the same explanatory power.
In practice:
⢠Model stepwise = AIC = BIC, approximately 36399.37.
⢠Models full or with interaction => Adj. R^2 similar but with more non-significant variables, and slightly higher AIC.
| Model | AIC | BIC | R2 | Adj_R2 | RMSE |
|---|---|---|---|---|---|
| Full Model | 36403.07 | 36461.11 | 0.3 | 0.3 | 407.16 |
| Interaction Model | 36404.47 | 36468.31 | 0.3 | 0.3 | 407.11 |
| Stepwise | 36399.37 | 36434.19 | 0.3 | 0.3 | 407.52 |
| AIC Model | 36399.37 | 36434.19 | 0.3 | 0.3 | 407.52 |
| BIC Model | 36399.37 | 36434.19 | 0.3 | 0.3 | 407.52 |
##
## Call:
## lm(formula = Peso ~ Gestazione + N.gravidanze + Fumatrici + Sesso,
## data = neonati_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1477.05 -274.13 -19.28 261.66 1864.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2537.500 198.084 -12.810 < 2e-16 ***
## Gestazione 146.614 5.070 28.918 < 2e-16 ***
## N.gravidanze 39.187 7.864 4.983 6.69e-07 ***
## Fumatrici1 -115.132 41.175 -2.796 0.00521 **
## SessoM 167.892 16.642 10.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 407.9 on 2444 degrees of freedom
## Multiple R-squared: 0.2991, Adjusted R-squared: 0.298
## F-statistic: 260.8 on 4 and 2444 DF, p-value: < 2.2e-16
## [1] 407.519
## StudRes Hat CookD
## 310 -0.3100152 0.019541517 0.0003832531
## 1385 -0.2330505 0.016488238 0.0001821767
## 1553 4.3665777 0.006150757 0.0234272751
## 1920 4.6147906 0.010720106 0.0457743482
| Scenario | Predicted_Weight | Lower_PI | Upper_PI |
|---|---|---|---|
| Mother on 3rd pregnancy, 39 weeks gestation, non-smoker, female | 3298 | 2497.1 | 4098.9 |