Neonatal Data Analysis

Data Loading and Initial Exploration

In the initial phase of the analysis, the dataset containing information about neonates was loaded and explored to understand its structure, size, and the characteristics of its variables.

Dataset Overview

The dataset comprises 2500 observations and 10 variables, each providing important information about neonatal characteristics, maternal factors, and delivery details. The variables are a mix of numerical and categorical data, offering a comprehensive view for analysis.

dati <- read.csv("/Users/matteocalzametre/Desktop/neonati.csv")
dim(dati)
## [1] 2500   10
names(dati)
##  [1] "Anni.madre"   "N.gravidanze" "Fumatrici"    "Gestazione"   "Peso"        
##  [6] "Lunghezza"    "Cranio"       "Tipo.parto"   "Ospedale"     "Sesso"
head(dati)
##   Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto
## 1         26            0         0         42 3380       490    325        Nat
## 2         21            2         0         39 3150       490    345        Nat
## 3         34            3         0         38 3640       500    375        Nat
## 4         28            1         0         41 3690       515    365        Nat
## 5         20            0         0         38 3700       480    335        Nat
## 6         32            0         0         40 3200       495    340        Nat
##   Ospedale Sesso
## 1     osp3     M
## 2     osp1     F
## 3     osp2     M
## 4     osp2     M
## 5     osp3     F
## 6     osp2     F
str(dati)
## 'data.frame':    2500 obs. of  10 variables:
##  $ Anni.madre  : int  26 21 34 28 20 32 26 25 22 23 ...
##  $ N.gravidanze: int  0 2 3 1 0 0 1 0 1 0 ...
##  $ Fumatrici   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gestazione  : int  42 39 38 41 38 40 39 40 40 41 ...
##  $ Peso        : int  3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
##  $ Lunghezza   : int  490 490 500 515 480 495 480 510 500 510 ...
##  $ Cranio      : int  325 345 375 365 335 340 345 349 335 362 ...
##  $ Tipo.parto  : chr  "Nat" "Nat" "Nat" "Nat" ...
##  $ Ospedale    : chr  "osp3" "osp1" "osp2" "osp2" ...
##  $ Sesso       : chr  "M" "F" "M" "M" ...
sum(is.na(dati))
## [1] 0

Key Variables and Their Types

Below is a description of the key variables, along with their types:

Observations

The dataset provides a detailed view of maternal factors, neonatal measurements, and delivery information, which are crucial for understanding birth outcomes. The variables offer diverse perspectives, with numerical variables supporting quantitative analysis and categorical variables enriching the dataset with qualitative information.

The target variable is neonate’s weight (Peso), which will be the focus of our predictive modeling efforts later in the analysis.

Analysis of Central Tendency, Variability, and Shape for Numerical Variables

In this section, we conducted a detailed analysis of key summary statistics for all numerical variables in the dataset. This includes measures of central tendency (mean and median), variability (standard deviation, interquartile range, and percentiles), and shape (skewness and kurtosis). These metrics provide critical insights into the characteristics of the data distribution.

library(moments)
library(knitr)

summary_stats <- sapply(dati[, c("Anni.madre","N.gravidanze","Gestazione","Peso", "Lunghezza", "Cranio")], function(x){
  c(Mean = round(mean(x),2),
    Median = round(median(x),2),
    SD = round(sd(x),2),
    IQR = round(IQR(x),2),
    percentil = round(quantile(x),2),
    Skewness = round(skewness(x),2),
    Kurtosis = round(kurtosis(x)-3))
})

kable(as.data.frame(t(summary_stats)), caption = "Summary Stats")
Summary Stats
Mean Median SD IQR percentil.0% percentil.25% percentil.50% percentil.75% percentil.100% Skewness Kurtosis
Anni.madre 28.16 28 5.27 7 0 25 28 32 46 0.04 0
N.gravidanze 0.98 1 1.28 1 0 0 1 1 12 2.51 11
Gestazione 38.98 39 1.87 2 25 38 39 40 43 -2.07 8
Peso 3284.08 3300 525.04 630 830 2990 3300 3620 4930 -0.65 2
Lunghezza 494.69 500 26.32 30 310 480 500 510 565 -1.51 6
Cranio 340.03 340 16.43 20 235 330 340 350 390 -0.79 3

Key Observations:

These findings help provide an initial understanding of the numerical variables’ distributions. Further graphical exploration will confirm and visualize these observations, especially for key variables like neonate weight and length.

Frequency Analysis of Categorical Variables

Overview

The following section provides a frequency distribution analysis of categorical variables in the dataset. For each variable, we present the absolute and relative frequencies, and in some cases, cumulative frequencies. These distributions help us understand the structure and proportion of the data for qualitative attributes.

  1. Delivery Type (Tipo.parto)

The variable “Tipo.parto” categorizes deliveries as either natural (Nat) or cesarean (Ces). Below is the frequency distribution:

Observation: We note a clear prevalence of natural deliveries over cesarean deliveries.

N <- dim(dati)[1]

freq_ass <- table(dati$Tipo.parto)
freq_rel <- round(table(dati$Tipo.parto)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
##     freq_ass freq_rel
## Ces      728     0.29
## Nat     1772     0.71
  1. Smoking Status (Fumatrici)

The variable “Fumatrici” captures whether the mother smoked during pregnancy. The data was recoded to “yes” for smokers and “no” for non-smokers. The results are as follows:

Observation: The dataset shows an overwhelming prevalence of non-smoking mothers (96%), with only 4% smokers.

dati$Fumatrici <- ifelse(dati$Fumatrici == 1, "yes", "no")
freq_ass <- table(dati$Fumatrici)
freq_rel <- round(table(dati$Fumatrici)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
##     freq_ass freq_rel
## no      2396     0.96
## yes      104     0.04
  1. Number of Pregnancies (N.gravidanze)

This variable captures the number of pregnancies. An initial analysis showed a wide range of values, which were later grouped into intervals:

freq_ass <- table(dati$N.gravidanze)
freq_rel <- round(table(dati$N.gravidanze)/N,2)
freq_ass_cum <- cumsum(freq_ass)
freq_rel_cum <- cumsum(freq_rel)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel, freq_ass_cum, freq_rel_cum))
distr_freq
##    freq_ass freq_rel freq_ass_cum freq_rel_cum
## 0      1096     0.44         1096         0.44
## 1       818     0.33         1914         0.77
## 2       340     0.14         2254         0.91
## 3       150     0.06         2404         0.97
## 4        48     0.02         2452         0.99
## 5        21     0.01         2473         1.00
## 6        11     0.00         2484         1.00
## 7         1     0.00         2485         1.00
## 8         8     0.00         2493         1.00
## 9         2     0.00         2495         1.00
## 10        3     0.00         2498         1.00
## 11        1     0.00         2499         1.00
## 12        1     0.00         2500         1.00

Observation: To facilitate interpretation, the variable was grouped into 5 classes based on frequency distribution, highlighting that most mothers had 0 or 1 pregnancy.

dati$N.gravidanze <- cut(dati$N.gravidanze,
                         breaks = c(-Inf,0,1,2,3,Inf),
                         labels = c("0","1","2","3",">3"))

freq_ass <- table(dati$N.gravidanze)
freq_rel <- round(table(dati$N.gravidanze)/N,2)
freq_ass_cum <- cumsum(freq_ass)
freq_rel_cum <- cumsum(freq_rel)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel, freq_ass_cum, freq_rel_cum))
distr_freq
##    freq_ass freq_rel freq_ass_cum freq_rel_cum
## 0      1096     0.44         1096         0.44
## 1       818     0.33         1914         0.77
## 2       340     0.14         2254         0.91
## 3       150     0.06         2404         0.97
## >3       96     0.04         2500         1.01
  1. Hospital of Delivery (Ospedale)

The dataset includes deliveries from three hospitals (osp1, osp2, osp3). The distribution of deliveries is as follows:

Observation: The deliveries are evenly distributed across the three hospitals, showing no significant imbalance.

freq_ass <- table(dati$Ospedale)
freq_rel <- round(table(dati$Ospedale)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
##      freq_ass freq_rel
## osp1      816     0.33
## osp2      849     0.34
## osp3      835     0.33
  1. Gender of Neonates (Sesso)

The gender of neonates is categorized as male (M) or female (F). The distribution is nearly equal:

Observation: The gender distribution is perfectly balanced, with an equal proportion of males and females.

freq_ass <- table(dati$Sesso)
freq_rel <- round(table(dati$Sesso)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
##   freq_ass freq_rel
## F     1256      0.5
## M     1244      0.5

Analysis of the Weight Variable (target variable)

The summary statistics for the “Weight” variable reveal a median value of 3300 grams and a mean of 3284.08 grams, with a standard deviation of 525.04 grams. The interquartile range (IQR) is 630 grams, indicating moderate variability in the central 50% of the data. The skewness of -0.65 suggests a slight negative asymmetry, indicating a higher frequency of heavier weights and a tail skewed toward lighter weights. The kurtosis of 2.03 shows a leptokurtic distribution, with more pronounced peaks and tails compared to a normal distribution.

These observations highlight that while the data is relatively symmetric around the mean and median, there are heavier weights with fewer extremely low weights, as evidenced by the negative skewness. The leptokurtic nature also suggests some concentration around the mean, with a few more extreme values influencing the distribution.

summary(dati$Peso)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     830    2990    3300    3284    3620    4930
round(sd(dati$Peso),2)
## [1] 525.04
round(skewness(dati$Peso),2)
## [1] -0.65
round(kurtosis(dati$Peso)-3,2)
## [1] 2.03

Statistical Test for Normality of Distribution

The Shapiro-Wilk Normality Test was applied to assess whether the distribution of birth weight follows a normal distribution. The null hypothesis (H0) of the test assumes that the data are normally distributed. The resulting p-value was significantly lower than the significance level of 0.05, leading us to reject the null hypothesis and conclude that the variable does not follow a normal distribution.

shapiro.test(dati$Peso)
## 
##  Shapiro-Wilk normality test
## 
## data:  dati$Peso
## W = 0.97066, p-value < 2.2e-16

Graphical Analysis

To complement the statistical indices, we performed a graphical analysis of the birth weight variable using three visualizations:

library(ggplot2)
library(gridExtra)

p1 <- ggplot(dati, aes(x = Peso)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
  geom_density(color = "blue", linewidth = 1) +
  ggtitle("Histogram with Density Curve") +
  xlab("Weigth")+
  theme_minimal()


p2 <- ggplot(dati, aes(sample = Peso)) +
  stat_qq(color = "blue") +
  stat_qq_line(color = "red") +
  ggtitle("QQ Plot of Neonatal Weight") +
  theme_minimal()


p3 <- ggplot(dati, aes(x = "", y = Peso)) +
  geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
  ggtitle("Boxplot of Neonatal Weight") +
  ylab("Weight (grams)") +
  theme_minimal()

grid.arrange(p1,p2,p3, ncol=3)

Conclusions

The combination of statistical indices, normality testing, and graphical analysis provides a comprehensive view of the birth weight variable. The results indicate that the distribution is slightly negatively skewed with a high kurtosis, and it significantly deviates from normality. These findings will be critical in further modeling and analysis steps, as they guide the choice of appropriate statistical methods to handle the observed deviations.

Analysis of the Length Variable

The summary statistics for the “Length” variable indicate a median value of 500 mm and a mean of 494.7 mm, with a standard deviation of 26.32 mm. The interquartile range (IQR) is 30 mm, reflecting the spread of the middle 50% of the data. The skewness of -1.51 reveals a negative asymmetry, indicating a higher frequency of larger lengths and a tail skewed toward smaller lengths. The kurtosis of 6.49 highlights a leptokurtic distribution, suggesting more pronounced peaks and tails compared to a normal distribution.

summary(dati$Lunghezza)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   310.0   480.0   500.0   494.7   510.0   565.0
round(sd(dati$Lunghezza),2)
## [1] 26.32
round(skewness(dati$Lunghezza),2)
## [1] -1.51
round(kurtosis(dati$Lunghezza)-3,2)
## [1] 6.49
library(ggplot2)
library(gridExtra)

p1 <- ggplot(dati, aes(x = Lunghezza)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
  geom_density(color = "blue", linewidth = 1) +
  ggtitle("Histogram with Density Curve") +
  xlab("Length")+
  theme_minimal()


p2 <- ggplot(dati, aes(sample = Lunghezza)) +
  stat_qq(color = "blue") +
  stat_qq_line(color = "red") +
  ggtitle("QQ Plot of Neonatal Length") +
  theme_minimal()


p3 <- ggplot(dati, aes(x = "", y = Lunghezza)) +
  geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
  ggtitle("Boxplot of Neonatal Length") +
  ylab("Length (mm)") +
  theme_minimal()

grid.arrange(p1, p2, p3, ncol = 3)

The graphical analysis supports these findings:

Overall, the “Length” variable demonstrates a negatively skewed and leptokurtic distribution, with most values centered around the median but with some extreme lower values influencing the distribution.

Analysis of Maternal Age (Anni Madre)

Addressing Anomalies in Maternal Age

During the preliminary analysis, we identified anomalous values in the Anni.madre variable, specifically two records indicating ages of 0 and 1 year. These values are implausible for maternal age and likely resulted from data entry errors. Rather than removing these records and potentially losing valuable information about other variables, we opted to address the anomalies through data imputation.

The median value of the variable Anni.madre, which is robust to the influence of outliers, was used to replace these implausible values. This choice ensures that the overall distribution and central tendency of the data remain representative while addressing inaccuracies.

After imputing the anomalous values, we calculated key descriptive statistics for Anni.madre:

dati$Anni.madre[dati$Anni.madre == 0 | dati$Anni.madre == 1] <- median(dati$Anni.madre)

summary(dati$Anni.madre)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   25.00   28.00   28.19   32.00   46.00
round(sd(dati$Anni.madre),2)
## [1] 5.22
round(skewness(dati$Anni.madre),2)
## [1] 0.15
round(kurtosis(dati$Anni.madre)-3,2)
## [1] -0.1
p1 <- ggplot(dati, aes(x = Anni.madre)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
  geom_density(color = "blue", linewidth = 1) +
  ggtitle("Histogram with Density Curve") +
  xlab("Maternal Age (year)")+
  theme_minimal()

p2 <- ggplot(dati, aes(x = "", y = Anni.madre)) +
  geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
  ggtitle("Boxplot of Maternal Age") +
  ylab("Maternal Age (year)") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

Graphical Analysis

We further examined the distribution of maternal age using:

  1. Density Histogram: Showcased a unimodal distribution, aligning closely with the descriptive statistics. The histogram revealed a concentration of values around the median age.
  2. Boxplot: Confirmed the presence of some high outliers but no significant anomalies remaining after the imputation.
Key Observations:

Analysis of the Variable: Gestation

To explore the distribution of the gestational weeks variable, we first calculated summary statistics including measures of central tendency, variability, and shape. The results are summarized below:

summary(dati$Gestazione)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   38.00   39.00   38.98   40.00   43.00
round(sd(dati$Gestazione),2)
## [1] 1.87
round(skewness(dati$Gestazione),2)
## [1] -2.07
round(kurtosis(dati$Gestazione)-3,2)
## [1] 8.26
ggplot(dati, aes(x = Gestazione)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 1, color = "black", fill = "lightblue") +
  geom_density(color = "blue", linewidth = 1) +
  ggtitle("Histogram with Density Curve") +
  xlab("Gestation (weeks)")+
  theme_minimal()

The histogram with a density curve reveals a negatively skewed distribution, consistent with the calculated skewness. The data shows a concentration of gestational weeks near the 40-week mark, with a notable peak. The presence of sharp peaks suggests some common gestation durations, likely reflective of standard medical practices or recording conventions.

Overall, the analysis highlights the distribution’s asymmetry and peakedness, which could warrant further investigation to understand medical or procedural influences on the data.

Analysis of the Variable: Cranial Circumference

The analysis of the cranial circumference variable highlights the following statistical properties:

summary(dati$Cranio)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     235     330     340     340     350     390
round(sd(dati$Cranio),2)
## [1] 16.43
round(skewness(dati$Cranio),2)
## [1] -0.79
round(kurtosis(dati$Cranio)-3,2)
## [1] 2.95

Graphical Analysis

The following plots provide a visual analysis of the cranial circumference distribution:

p1 <- ggplot(dati, aes(x = Cranio)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
  geom_density(color = "blue", linewidth = 1) +
  ggtitle("Histogram with Density Curve") +
  xlab("Cranial Circumference (mm)")+
  theme_minimal()

p2 <- ggplot(dati, aes(sample = Cranio)) +
  stat_qq(color = "blue") +
  stat_qq_line(color = "red") +
  ggtitle("QQ Plot of Cranial Circumference") +
  theme_minimal()

p3 <- ggplot(dati, aes(x = "", y = Cranio)) +
  geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
  ggtitle("Boxplot of Cranial Circumference") +
  ylab("Cranial Circumference (mm)") +
  theme_minimal()

grid.arrange(p1, p2, p3, ncol = 3)

Overall, the analysis of cranial circumference reveals a moderately variable distribution with a slight left skew and a peaked profile.

Graphical Analysis of Qualitative Variables

The graphical analysis of qualitative variables is performed to understand the distribution of categories within each variable. Bar plots were used to visualize the frequencies of each category, providing an intuitive representation of the data.

p1 <- ggplot(dati, aes(x = N.gravidanze)) +
  geom_bar(fill = "lightblue", color = "black") +
  ggtitle("Distribution of Number of Pregnancies") +
  xlab("Number of Pregnancies") +
  ylab("Frequency") +
  theme_minimal()

p2 <- ggplot(dati, aes(x = Fumatrici)) +
  geom_bar(fill = "lightgreen", color = "black") +
  ggtitle("Distribution of Smokers") +
  xlab("Smokers (yes/no)") +
  ylab("Frequency") +
  theme_minimal()

p3 <- ggplot(dati, aes(x = Tipo.parto)) +
  geom_bar(fill = "lightcoral", color = "black") +
  ggtitle("Distribution of Delivery Type") +
  xlab("Delivery Type (Ces/Nat)") +
  ylab("Frequency") +
  theme_minimal()

p4 <- ggplot(dati, aes(x = Ospedale)) +
  geom_bar(fill = "lightpink", color = "black") +
  ggtitle("Distribution of Hospital") +
  xlab("Hospital") +
  ylab("Frequency") +
  theme_minimal()

p5 <- ggplot(dati, aes(x = Sesso)) +
  geom_bar(fill = "lightcyan", color = "black") +
  ggtitle("Distribution of Gender") +
  xlab("Gender (F/M)") +
  ylab("Frequency") +
  theme_minimal()

grid.arrange(p1, p2, p3, p4, p5, ncol = 2)

Discussion of Results

From the graphical analysis, we observe the following key findings:

These findings provide a comprehensive overview of the dataset’s categorical variables and validate the consistency of the data, serving as a strong foundation for further analyses.

Analysis of Correlation Between Numerical Variables

In this section, we analyze the correlation between the numerical variables in the dataset. A correlation matrix was computed to identify potential linear relationships between variables. Visualizations, such as the correlation plot, were used to enhance the interpretation of these relationships. Particular attention was given to correlations with the target variable, Peso (weight), to explore potential predictors or influential variables.

library(ggcorrplot)

numerical_vars <- dati[, c("Peso", "Lunghezza", "Cranio", "Anni.madre", "Gestazione")]
cor_matrix <- as.data.frame(round(cor(numerical_vars),2))
cor_matrix
##             Peso Lunghezza Cranio Anni.madre Gestazione
## Peso        1.00      0.80   0.70      -0.02       0.59
## Lunghezza   0.80      1.00   0.60      -0.06       0.62
## Cranio      0.70      0.60   1.00       0.02       0.46
## Anni.madre -0.02     -0.06   0.02       1.00      -0.13
## Gestazione  0.59      0.62   0.46      -0.13       1.00

Key Observations

Overall, the variables Lunghezza, Cranio, and Gestazione emerge as key factors potentially influencing the target variable Peso. These insights will be crucial for further modeling and prediction tasks.

ggcorrplot(cor_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           colors = c("blue", "white", "red"), 
           title = "Correlation Matrix", 
           ggtheme = theme_minimal())

Analysis of Correlation Between Numerical Variables and Target Variable

Weight - Length

The computed correlation coefficient between Weight and Length was 0.796, indicating a strong positive linear relationship.

cor(dati$Peso, dati$Lunghezza)
## [1] 0.7960368

Visualization of the Relationship

A scatter plot was generated to visually explore the correlation between Peso and Lunghezza. The points display a clear upward trend, further confirming the positive association between these variables.

ggplot(dati, aes(x = Lunghezza, y = Peso)) +
  geom_point(color = "black") +
  ggtitle("Scatter Plot between Weight and Length") +
  xlab("Length (mm)") +
  ylab("Weight (grams)") +
  theme_minimal()

Significance Testing of the Correlation

A Pearson correlation test was conducted to assess the statistical significance of the observed correlation. The results (p-value < 2.2e-16) strongly reject the null hypothesis, suggesting that the correlation is highly significant.

cor_test <- cor.test(dati$Lunghezza, dati$Peso)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Lunghezza and dati$Peso
## t = 65.735, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7812132 0.8099631
## sample estimates:
##       cor 
## 0.7960368

Weight-Cranial Circumference

The computed Pearson’s correlation coefficient between Peso and Cranio is 0.704, indicating a strong positive linear relationship. This suggests that as the cranial diameter increases, the birth weight also tends to increase.

cor(dati$Peso, dati$Cranio)
## [1] 0.7048015

Visualization of the Relationship

The scatter plot between Peso and Cranio shows a clear upward trend. This visualization supports the observed positive association, further reinforcing the linear relationship suggested by the correlation coefficient.

ggplot(dati, aes(x = Cranio, y = Peso))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Wight and Cranial Circumference")+
  xlab("Cranial Circumference (mm)")+
  ylab("Weight (grams)")+
  theme_minimal()

Significance Testing of the Correlation

A Pearson correlation test was performed to assess the statistical significance of the correlation between Weight and Cranial Circumference. The test results (p-value < 2.2e-16) indicate a highly significant correlation, confirming that the observed relationship is unlikely to be due to chance.

cor_test <- cor.test(dati$Peso, dati$Cranio)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Peso and dati$Cranio
## t = 49.656, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6845119 0.7240000
## sample estimates:
##       cor 
## 0.7048015

Weight-Gestation

The Pearson’s correlation coefficient between Weight and Gestation is 0.592, representing a moderately strong positive linear relationship. This implies that longer gestational periods are generally associated with higher birth weights.

cor(dati$Peso, dati$Gestazione)
## [1] 0.5917687

Visualization of the Relationship

A scatter plot of Peso versus Gestazione illustrates the positive relationship, with an upward trend indicating that higher gestational weeks contribute to greater birth weights.

ggplot(dati, aes(x = Gestazione, y = Peso))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Weight and Gestation")+
  xlab("Gestation (weeks)")+
  ylab("Weight (grams)")+
  theme_minimal()

Significance Testing of the Correlation

A Pearson correlation test confirmed the statistical significance of the observed relationship between Weight and Gestation. The highly significant p-value (< 2.2e-16) rejects the null hypothesis, affirming the positive correlation between these variables.

cor_test <- cor.test(dati$Peso, dati$Gestazione)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Peso and dati$Gestazione
## t = 36.691, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5656894 0.6166655
## sample estimates:
##       cor 
## 0.5917687

Lenght-Cranial Circumference

The computed Pearson correlation coefficient between Length and Cranial Circumference is 0.603, suggesting a moderate positive relationship between these two variables.

cor(dati$Lunghezza, dati$Cranio)
## [1] 0.603341

Visualization of the Relationship

The scatter plot for Length and Cranial Circumference demonstrates a clear upward trend, confirming the positive association between these variables.

ggplot(dati, aes(x = Lunghezza, y = Cranio))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Weight and Cranial Circumference")+
  xlab("Lenght (mm)")+
  ylab("Cranial Circumference (mm)")+
  theme_minimal()

Significance Testing of the Correlation

A Pearson correlation test confirms the statistical significance of the correlation between Lenght and Cranial Circumference. The p-value is < 2.2e-16, strongly rejecting the null hypothesis, indicating a meaningful association.

cor_test <- cor.test(dati$Lunghezza, dati$Cranio)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Lunghezza and dati$Cranio
## t = 37.813, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5778048 0.6276970
## sample estimates:
##      cor 
## 0.603341

Length-Gestation

The correlation coefficient between Length and Gestation is 0.619, which represents a moderate positive linear relationship.

cor(dati$Lunghezza, dati$Gestazione)
## [1] 0.6189204

Visualization of the Relationship

The scatter plot for Length and Gestation illustrates a positive trend, supporting the observed correlation between the two variables.

ggplot(dati, aes(x = Lunghezza, y = Gestazione))+
  geom_point(color = "black")+
  ggtitle("Scatter between Lenght and Gestation")+
  xlab("Length (mm)")+
  ylab("Gestation (weeks)")+
  theme_minimal()

Significance Testing of the Correlation

The significance test for the correlation between Length and Gestation yields a p-value of < 2.2e-16, providing strong evidence for the significance of the relationship.

cor_test <- cor.test(dati$Lunghezza, dati$Gestazione)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Lunghezza and dati$Gestazione
## t = 39.383, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5941334 0.6425332
## sample estimates:
##       cor 
## 0.6189204

Cranial Circumference-Gestation

The correlation coefficient between Cranial Circumference and Gestation was calculated to be 0.4608, indicating a moderate positive correlation.

cor(dati$Cranio, dati$Gestazione)
## [1] 0.4608289

Visualization of the Relationship: The scatter plot revealed a generally upward trend, supporting the moderate positive relationship between the two variables.

ggplot(dati, aes(x= Cranio, y = Gestazione))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Cranial Circumference and Gestation")+
  xlab("Cranial Circumference (mm)")+
  ylab("Gestation (weeks)")+
  theme_minimal()

Significance Testing: The Pearson correlation test results showed a p-value < 2.2e-16, suggesting the correlation is statistically significant.

cor_test <- cor.test(dati$Cranio, dati$Gestazione)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Cranio and dati$Gestazione
## t = 25.952, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4293834 0.4911585
## sample estimates:
##       cor 
## 0.4608289

Maternal Age-Gestation

Calculation of Pearson’s Correlation Coefficient: The correlation coefficient for Maternal Age and Gestation was -0.1349, indicating a weak negative correlation.

cor(dati$Anni.madre, dati$Gestazione)
## [1] -0.1349262

Visualization of the Relationship: The scatter plot did not show a pronounced trend, consistent with the weak negative correlation observed.

ggplot(dati, aes(x = Gestazione, y = Anni.madre))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Maternal Age and Gestation")+
  xlab("Gestation (weeks)")+
  ylab("Maternal Age")+
  theme_minimal()

Significance Testing: The Pearson correlation test yielded a p-value of 1.253e-11, which indicates the weak correlation is statistically significant.

cor_test <- cor.test(dati$Anni.madre, dati$Gestazione)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Anni.madre and dati$Gestazione
## t = -6.8058, df = 2498, p-value = 1.253e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17321275 -0.09623254
## sample estimates:
##        cor 
## -0.1349262

Maternal Age vs Cranial Circumference

Calculation of Pearson’s Correlation Coefficient: The correlation coefficient between Maternal Age and Cranial Circumference was 0.0162. This suggests no meaningful correlation between the two variables.

cor(dati$Anni.madre, dati$Cranio)
## [1] 0.01620269

Visualization of the Relationship: The scatter plot showed a random distribution of points, reflecting the negligible correlation.

ggplot(dati, aes(x = Cranio, y = Anni.madre))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Maternal Age and Cranial Circumference")+
  xlab("Cranial Circumference (mm)")+
  ylab("Maternal Age")+
  theme_minimal()

Significance Testing: The correlation test had a p-value of 0.4181, indicating the lack

cor_test <- cor.test(dati$Anni.madre, dati$Cranio)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Anni.madre and dati$Cranio
## t = 0.80992, df = 2498, p-value = 0.4181
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.02301464  0.05537024
## sample estimates:
##        cor 
## 0.01620269

Maternal Age-Weight

Calculation of Pearson’s Correlation Coefficient: The correlation coefficient for Maternal Age and Weight was -0.0238, pointing to a very weak negative correlation.

cor(dati$Anni.madre, dati$Peso)
## [1] -0.02377346

Visualization of the Relationship: The scatter plot displayed no clear trend, aligning with the weak correlation.

ggplot(dati, aes(x = Peso, y = Anni.madre))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Maternal Age and Weight")+
  xlab("Weight (grams)")+
  ylab("Maternal Age")+
  theme_minimal()

Significance Testing: The p-value of 0.2347 indicates the correlation is not statistically significant.

cor_test <- cor.test(dati$Anni.madre, dati$Peso)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Anni.madre and dati$Peso
## t = -1.1885, df = 2498, p-value = 0.2347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06291754  0.01544365
## sample estimates:
##         cor 
## -0.02377346

Maternal Age-Lenght

Calculation of Pearson’s Correlation Coefficient: The correlation coefficient for Maternal Age and Length was -0.0649, indicating a weak negative correlation.

cor(dati$Anni.madre, dati$Lunghezza)
## [1] -0.06495563

Visualization of the Relationship: The scatter plot showed a slight downward trend, consistent with the weak negative correlation.

ggplot(dati, aes(x = Lunghezza, y = Anni.madre))+
  geom_point(color = "black")+
  ggtitle("Scatter Plot between Maternal Age and Length")+
  xlab("Length (mm)")+
  ylab("Maternal Age")+
  theme_minimal()

Significance Testing: The Pearson correlation test produced a p-value of 0.001156, indicating that the weak correlation is statistically significant.

cor_test <- cor.test(dati$Anni.madre, dati$Lunghezza)
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  dati$Anni.madre and dati$Lunghezza
## t = -3.2534, df = 2498, p-value = 0.001156
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10389379 -0.02581865
## sample estimates:
##         cor 
## -0.06495563

Correlation Analysis between Target Variable and Qualitative Variables/h3>

Weight-Hospital

The violin plot presented above illustrates the distribution of neonate weights across the three hospitals: osp1, osp2, and osp3. The plot combines density estimation with boxplots, allowing for a clear visualization of the central tendency and spread of weights within each hospital. Each violin shape indicates the distribution of weights, while the embedded boxplot highlights the median and interquartile range. Observing the plot, it seems that the distributions of weights are relatively similar across the hospitals, with minor differences in the spread and density.

ggplot(dati, aes(x = Ospedale, y = Peso, fill = Ospedale)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.1, color = "black", outlier.shape = NA) +
  labs(title = "Distribution of Weight by Hospital",
       x = "Hospital", y = "Weight (grams)") +
  theme_minimal() +
  theme(legend.position = "none")

Analysis of Variance (ANOVA) Results

To statistically verify whether there are significant differences in the mean weights of neonates across the three hospitals, an ANOVA test was conducted. The summary of the ANOVA results is shown above. The F-statistic value is 1.699, and the associated p-value is 0.183. Since the p-value is greater than the conventional significance level (e.g., 0.05), we fail to reject the null hypothesis. This indicates that there is no statistically significant difference in the mean weights of neonates among the three hospitals.

anova_result <- aov(Peso ~ Ospedale, data = dati)
summary(anova_result)
##               Df    Sum Sq Mean Sq F value Pr(>F)
## Ospedale       2    936237  468118   1.699  0.183
## Residuals   2497 687952305  275512

Weight-Number of Pregnancies

The violin plot visualizes the distribution of neonatal weight (in grams) across different categories of the number of pregnancies.The plot shows a consistent central tendency across categories, with no notable outliers or significant differences in shape or spread of the distributions. This suggests that the number of pregnancies does not strongly influence neonatal weight in this dataset.

ggplot(dati, aes(x=N.gravidanze,y=Peso, fill = N.gravidanze))+
  geom_violin(trim = FALSE, alpha=0.6) +
  geom_boxplot(width = 0.1, color = "black", outlier.shape=NA)+
  labs(title = "Distribution of Weight by Number of Pregnancies",
       x = "Number of Pregnancies", y = "Weight (grams)")+
  theme_minimal()+
  theme(legend.position = "none")

The ANOVA test was conducted to statistically evaluate if the mean neonatal weights differ significantly across the categories of the number of pregnancies. The results indicate a p-value of 0.755, which is well above the standard significance level of 0.05. This suggests that we fail to reject the null hypothesis, and there is no statistical evidence of significant differences in the mean weights between the groups defined by the number of pregnancies.

anova_result <- aov(Peso ~ N.gravidanze, data = dati)
summary(anova_result)
##                Df    Sum Sq Mean Sq F value Pr(>F)
## N.gravidanze    4    522163  130541   0.473  0.755
## Residuals    2495 688366379  275898

Weight-Smokers

The violin plot provides a detailed visualization of the distribution of neonatal weights for the two groups: mothers who smoke and mothers who do not smoke. The plot reveals slight differences in weight distribution between the groups, with non-smoking mothers generally having neonates with higher weights. However, the overlapping distributions suggest the differences may not be statistically significant.

ggplot(dati, aes(x=Fumatrici, y=Peso, fill=Fumatrici))+
  geom_violin(trim = FALSE, alpha=0.6)+
  geom_boxplot(width = 0.1, color = "black", outlier.shape=NA)+
  labs(title = "Distribution of Weight by Smokers",
       x = "Smokers (yes/no)", y = "Weight (grams)")+
  theme_minimal()+
  theme(legend.position = "none")

The Welch two-sample t-test was performed to assess whether the mean neonatal weight significantly differs between the two groups (smokers and non-smokers). The test yielded a t-value of 1.034 with a p-value of 0.3033. This result indicates that there is no statistically significant difference in neonatal weight between the two groups at a 5% significance level.

t.test(Peso~Fumatrici, data = dati)
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -45.61354 145.22674
## sample estimates:
##  mean in group no mean in group yes 
##          3286.153          3236.346

Weight-Delivery Type

The violin plot above displays the distribution of neonatal weight according to the type of delivery (Cesarean or Natural). The distributions of weight for both groups are visually similar, with overlapping ranges and comparable medians.

ggplot(dati, aes(x=Tipo.parto, y=Peso, fill = Tipo.parto))+
  geom_violin(trim = FALSE, alpha = 0.6)+
  geom_boxplot(width=0.1, color = "black", outlier.shape=NA)+
  labs(title = "Distribution of Weight by Deliverry Type",
       x = "Delivery Type", y = "Weight (grams)")+
  theme_minimal()+
  theme(legend.position = "none")

With a p-value of 0.8968, the null hypothesis cannot be rejected. This suggests that there is no statistically significant difference in the mean neonatal weight between Cesarean and Natural delivery groups.

t.test(Peso~Tipo.parto, data = dati)
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Tipo.parto
## t = -0.12968, df = 1493, p-value = 0.8968
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
##  -46.27992  40.54037
## sample estimates:
## mean in group Ces mean in group Nat 
##          3282.047          3284.916

Weight-Gender

The violin plot shows the weight distributions for males and females. Male neonates exhibit a higher median weight and a wider spread compared to females.

ggplot(dati, aes(x=Sesso, y=Peso, fill = Sesso))+
  geom_violin(trim = FALSE, alpha = 0.6)+
  geom_boxplot(width = 0.1, color = "black", outlier.shape = NA)+
  labs(title="Distribution of Weight by Gender",
       x = "Gender (M/F)", y = "Weight (grams)")+
  theme_minimal()+
  theme(legend.position="none")

The Welch’s t-test indicates a significant difference in mean weights (p < 0.001). Males have a higher mean weight (3408.215g) compared to females (3161.132g), confirmed by a confidence interval of (-287.1051, -207.0615).

t.test(Peso~Sesso, data = dati)
## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M 
##        3161.132        3408.215

Building and Evaluating the Multiple Linear Regression Model

Exploratory Correlation and Scatter Plot Analysis

The initial correlation matrix provides insights into the relationships between the target variable (Weight) and the potential predictors. The lower half of the matrix highlights the correlation coefficients, indicating the strength and direction of linear relationships. The scatter plots in the upper half allow us to visually inspect trends and potential non-linear relationships, which might suggest interactions or transformations for consideration in the regression model.

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(numerical_vars, upper.panel = panel.smooth, lower.panel = panel.cor)

Model 1: Initial Multiple Linear Regression Model

The first model includes all potential predictors. The regression summary shows the coefficients, standard errors, t-values, and p-values for each predictor. The adjusted R-squared value is high, indicating that the model explains a significant portion of the variance in the target variable. However, some predictors, such as ‘Anni.madre’ and ‘Fumatrici’, have high p-values, suggesting they are not statistically significant in the presence of other predictors.

mod1 <- lm(dati$Peso~., data = dati)
summary(mod1)
## 
## Call:
## lm(formula = dati$Peso ~ ., data = dati)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1142.56  -180.14   -14.17   160.99  2619.90 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6727.1192   141.4412 -47.561  < 2e-16 ***
## Anni.madre         0.4905     1.1567   0.424  0.67157    
## N.gravidanze1      9.0377    13.0227   0.694  0.48775    
## N.gravidanze2     49.4827    17.7445   2.789  0.00533 ** 
## N.gravidanze3     35.2155    24.9340   1.412  0.15797    
## N.gravidanze>3    63.3781    30.4873   2.079  0.03773 *  
## Fumatriciyes     -31.5914    27.5653  -1.146  0.25188    
## Gestazione        32.5875     3.8284   8.512  < 2e-16 ***
## Lunghezza         10.2919     0.3004  34.255  < 2e-16 ***
## Cranio            10.4616     0.4270  24.500  < 2e-16 ***
## Tipo.partoNat     30.0910    12.0858   2.490  0.01285 *  
## Ospedaleosp2     -10.3258    13.4419  -0.768  0.44246    
## Ospedaleosp3      28.0190    13.4970   2.076  0.03800 *  
## SessoM            77.2745    11.1739   6.916  5.9e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.8 on 2486 degrees of freedom
## Multiple R-squared:  0.7294, Adjusted R-squared:  0.728 
## F-statistic: 515.5 on 13 and 2486 DF,  p-value: < 2.2e-16

Model 2: Refined Multiple Linear Regression Model

To improve the model, ‘Anni.madre’, the least significant predictor, was removed. The updated regression summary indicates that the adjusted R-squared value remains almost unchanged, suggesting that the exclusion of ‘Anni.madre’ did not adversely affect the explanatory power of the model. Several predictors remain highly significant, including ‘Lunghezza’, ‘Cranio’, and ‘Gestazione’.

mod2 <- update(mod1, ~.-Anni.madre)
summary(mod2)
## 
## Call:
## lm(formula = dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = dati)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1137.07  -180.22   -14.86   160.56  2622.51 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6710.6687   135.9945 -49.345  < 2e-16 ***
## N.gravidanze1     10.0509    12.7995   0.785  0.43238    
## N.gravidanze2     51.4583    17.1192   3.006  0.00267 ** 
## N.gravidanze3     38.0322    24.0290   1.583  0.11360    
## N.gravidanze>3    66.8931    29.3339   2.280  0.02267 *  
## Fumatriciyes     -31.7750    27.5573  -1.153  0.24900    
## Gestazione        32.4348     3.8108   8.511  < 2e-16 ***
## Lunghezza         10.2910     0.3004  34.259  < 2e-16 ***
## Cranio            10.4696     0.4265  24.546  < 2e-16 ***
## Tipo.partoNat     30.1395    12.0833   2.494  0.01268 *  
## Ospedaleosp2     -10.2173    13.4373  -0.760  0.44711    
## Ospedaleosp3      28.1662    13.4903   2.088  0.03691 *  
## SessoM            77.3253    11.1714   6.922 5.66e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.8 on 2487 degrees of freedom
## Multiple R-squared:  0.7294, Adjusted R-squared:  0.7281 
## F-statistic: 558.6 on 12 and 2487 DF,  p-value: < 2.2e-16

Comparison of Model 1 and Model 2

The ANOVA comparison between the two models reveals that the reduction in model complexity (removing ‘Anni.madre’) does not result in a statistically significant loss of explained variance, as indicated by the high p-value in the ANOVA test. Additionally, the BIC value for Model 2 is lower than that for Model 1, further supporting the selection of the refined model as it balances variance explained and model simplicity.

anova(mod2, mod1)
## Analysis of Variance Table
## 
## Model 1: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## Model 2: dati$Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1   2487 186411655                           
## 2   2486 186398173  1     13482 0.1798 0.6716
BIC(mod2, mod1)
##      df      BIC
## mod2 14 35252.78
## mod1 15 35260.43

Model 3: Refined Multiple Linear Regression Model

The variable “Ospedale” was removed from the regression model following the analysis of the significance of predictors in Model 2. “Ospedale” was the least significant variable in Model 2, as indicated by its high p-value. Removing this variable aims to streamline the model while maintaining explanatory power.

In Model 3, the adjusted R-squared value remains at 0.7274, suggesting a high proportion of variance explained by the predictors. The residual standard error shows little change, confirming that the removal of “Ospedale” did not degrade the model’s performance. Significant predictors in Model 3 include “Gestazione,” “Lunghezza,” and “Cranio” (p-value < 0.001), along with “Tipo.partoNat” and “N.gravidanze3” (p-value < 0.05). However, predictors like “N.gravidanze1” and “Fumatrici” remain non-significant.

mod3 <- update(mod2, ~.-Ospedale)
summary(mod3)
## 
## Call:
## lm(formula = dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + 
##     Lunghezza + Cranio + Tipo.parto + Sesso, data = dati)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1153.4  -180.2   -16.6   159.8  2640.6 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6710.8555   136.0317 -49.333  < 2e-16 ***
## N.gravidanze1     10.8914    12.8119   0.850  0.39535    
## N.gravidanze2     53.4958    17.1277   3.123  0.00181 ** 
## N.gravidanze3     40.5364    24.0454   1.686  0.09196 .  
## N.gravidanze>3    68.1029    29.3678   2.319  0.02048 *  
## Fumatriciyes     -33.3106    27.5890  -1.207  0.22740    
## Gestazione        32.7455     3.8146   8.584  < 2e-16 ***
## Lunghezza         10.2691     0.3007  34.155  < 2e-16 ***
## Cranio            10.4795     0.4271  24.538  < 2e-16 ***
## Tipo.partoNat     30.9233    12.0968   2.556  0.01064 *  
## SessoM            77.7897    11.1852   6.955  4.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.2 on 2489 degrees of freedom
## Multiple R-squared:  0.7284, Adjusted R-squared:  0.7274 
## F-statistic: 667.7 on 10 and 2489 DF,  p-value: < 2.2e-16

Comparison of Model 3 and Model 2

The ANOVA comparison between Model 3 and Model 2 reveals that removing “Ospedale” resulted in a statistically significant reduction in explained variance, as indicated by an F-statistic of 4.4037 with a p-value of 0.01233. However, the decrease in adjusted R-squared (from 0.7281 to 0.7274) and the slight change in RSS (from 187071813 to 186411655) suggest that the reduction in variance explained is minimal in practical terms.

Additionally, Model 3 has a slightly lower Bayesian Information Criterion (BIC) of 35245.97 compared to 35252.78 for Model 2. This lower BIC value indicates that Model 3 achieves a better balance between variance explained and model complexity. Given the minimal contribution of “Ospedale” to predictive power and its lack of statistical significance in Model 2, Model 3 is considered the more optimal and parsimonious choice.

anova(mod3, mod2)
## Analysis of Variance Table
## 
## Model 1: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Sesso
## Model 2: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)  
## 1   2489 187071813                              
## 2   2487 186411655  2    660158 4.4037 0.01233 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod3, mod2)
##      df      BIC
## mod3 12 35245.97
## mod2 14 35252.78

Model 4: Refined Multiple Linear Regression Model

Model 4 was derived by removing the variable “Number of Pregnancies” from Model 3, as it was the least significant predictor in the previous model, confirmed by its high p-value and lack of significance. After updating the model, we analyze the regression report to assess whether the significant predictors in Model 3 remain significant in Model 4. The results indicate that all previously significant predictors maintain their significance, while the overall adjusted R-squared slightly decreases from 0.7274 to 0.7262.

mod4 <- update(mod3, ~.-N.gravidanze)
summary(mod4)
## 
## Call:
## lm(formula = dati$Peso ~ Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Sesso, data = dati)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1118.88  -185.33   -16.16   161.80  2622.90 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6675.9335   135.7771 -49.168  < 2e-16 ***
## Fumatriciyes    -27.5245    27.5783  -0.998   0.3184    
## Gestazione       31.4067     3.7882   8.291  < 2e-16 ***
## Lunghezza        10.2276     0.3011  33.969  < 2e-16 ***
## Cranio           10.6379     0.4242  25.075  < 2e-16 ***
## Tipo.partoNat    29.3167    12.1132   2.420   0.0156 *  
## SessoM           79.2477    11.2025   7.074 1.95e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2493 degrees of freedom
## Multiple R-squared:  0.7268, Adjusted R-squared:  0.7262 
## F-statistic:  1106 on 6 and 2493 DF,  p-value: < 2.2e-16
Comparison of Model 4 and Model 3

The ANOVA comparison between Model 4 and Model 3 shows a significant reduction in explained variance, with an F-statistic of 3.678 and a p-value of 0.005426, which is below the 5% significance level. This indicates that the removal of the variable “Number of Pregnancies” has led to a measurable reduction in the explained variance, even though the impact on the overall model performance appears minimal.

Despite the reduction in explained variance, Model 4 demonstrates a better Bayesian Information Criterion (BIC) score of 35229.41 compared to 35245.97 for Model 3. The lower BIC suggests that Model 4 is a more parsimonious choice, offering a better balance between model complexity and explanatory power, even after the removal of the variable “Number of Pregnancies.”

anova(mod4, mod3)
## Analysis of Variance Table
## 
## Model 1: dati$Peso ~ Fumatrici + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso
## Model 2: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Sesso
##   Res.Df       RSS Df Sum of Sq     F   Pr(>F)   
## 1   2493 188177555                               
## 2   2489 187071813  4   1105742 3.678 0.005426 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod4, mod3)
##      df      BIC
## mod4  8 35229.41
## mod3 12 35245.97

Model 5: Refined Multiple Linear Regression Model

The Model 5 was created by updating Model 4, removing the variable Fumatrici, which was found to be non-significant in the previous model. The updated model retains all previously significant predictors (Gestazione, Lunghezza, Cranio, Tipo.parto, and Sesso). All these predictors remain significant in Model 5 with p-values < 0.05. The Adjusted R-squared of the model remains almost unchanged at 0.7262, indicating that the explanatory power of the model has been maintained despite the removal of Fumatrici.

mod5 <- update(mod4, ~.-Fumatrici)
summary(mod5)
## 
## Call:
## lm(formula = dati$Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso, data = dati)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1118.43  -185.56   -16.07   161.53  2626.18 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6675.8084   135.7769 -49.167  < 2e-16 ***
## Gestazione       31.1917     3.7821   8.247 2.59e-16 ***
## Lunghezza        10.2412     0.3008  34.049  < 2e-16 ***
## Cranio           10.6397     0.4242  25.080  < 2e-16 ***
## Tipo.partoNat    29.1062    12.1113   2.403   0.0163 *  
## SessoM           79.0670    11.2010   7.059 2.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.7 on 2494 degrees of freedom
## Multiple R-squared:  0.7267, Adjusted R-squared:  0.7262 
## F-statistic:  1326 on 5 and 2494 DF,  p-value: < 2.2e-16
Comparison of Model 5 and Model 4

The ANOVA comparison between Model 5 and Model 4 shows that removing Fumatrici does not lead to a significant reduction in the explained variance. The F-statistic is 0.9961 with a p-value of 0.3184, which is greater than the 5% significance threshold. This indicates that the removal of Fumatrici does not significantly affect the predictive ability of the model.

The Bayesian Information Criterion (BIC) comparison reveals that Model 5 has a slightly lower BIC value (35222.59) compared to Model 4 (35229.41). This indicates that Model 5 is more parsimonious and provides a better balance between model complexity and variance explained. Consequently, Model 5 is preferred over Model 4.

anova(mod5, mod4)
## Analysis of Variance Table
## 
## Model 1: dati$Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + Sesso
## Model 2: dati$Peso ~ Fumatrici + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso
##   Res.Df       RSS Df Sum of Sq      F Pr(>F)
## 1   2494 188252743                           
## 2   2493 188177555  1     75188 0.9961 0.3184
BIC(mod5, mod4)
##      df      BIC
## mod5  7 35222.59
## mod4  8 35229.41

Model 6: Refined Multiple Linear Regression Model

We updated Model 5 by removing the variable “Tipo Parto”. While “Tipo Parto” was identified as significant in the regression analysis, earlier exploratory analysis suggested no strong correlation between “Tipo Parto” and “Peso”, as evidenced by the violin and box plots and the t-test analysis. These findings motivated the removal of “Tipo Parto” to evaluate the model’s performance without this variable.

mod6 <- update(mod5, ~.-Tipo.parto)
summary(mod6)
## 
## Call:
## lm(formula = dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso, 
##     data = dati)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1138.2  -184.3   -17.6   163.3  2627.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6651.1188   135.5172 -49.080  < 2e-16 ***
## Gestazione     31.2737     3.7856   8.261 2.31e-16 ***
## Lunghezza      10.2054     0.3007  33.939  < 2e-16 ***
## Cranio         10.6704     0.4245  25.139  < 2e-16 ***
## SessoM         79.1049    11.2117   7.056 2.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275 on 2495 degrees of freedom
## Multiple R-squared:  0.7261, Adjusted R-squared:  0.7257 
## F-statistic:  1654 on 4 and 2495 DF,  p-value: < 2.2e-16
Comparison of Model 6 and Model 5

The ANOVA comparison between Model 6 and Model 5 indicates a statistically significant reduction in the explained variance (p-value = 0.01632), suggesting that the removal of “Tipo Parto” results in a loss of explanatory power. However, the extent of this reduction is relatively minor, as indicated by the small F-statistic value of 5.7755.

The BIC for Model 6 is slightly lower (35220.54) compared to Model 5 (35222.59), indicating that Model 6 achieves a marginally better balance between goodness-of-fit and model complexity. Based on the BIC, Model 6 is the more parsimonious choice despite the statistically significant loss of variance explained.

anova(mod6, mod5)
## Analysis of Variance Table
## 
## Model 1: dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Model 2: dati$Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + Sesso
##   Res.Df       RSS Df Sum of Sq      F  Pr(>F)  
## 1   2495 188688687                              
## 2   2494 188252743  1    435945 5.7755 0.01632 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod6, mod5)
##      df      BIC
## mod6  6 35220.54
## mod5  7 35222.59

Model 7: Refined Multiple Linear Regression Model

The quadratic term for “Lunghezza” was added to Model 6 to account for the non-linear relationship observed between “Lunghezza” and “Peso”. The summary report shows that all coefficients, including the quadratic term, are statistically significant. This suggests that the inclusion of this non-linear interaction improves the model’s explanatory power. Specifically, the quadratic term for “Lunghezza” significantly enhances the prediction of “Peso”, confirming the initial observation of a curved relationship in the scatter plot matrix.

mod7 <- update(mod6, ~.+I(Lunghezza^2))
summary(mod7)
## 
## Call:
## lm(formula = dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso + 
##     I(Lunghezza^2), data = dati)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1156.75  -182.00   -12.52   166.67  1783.83 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    156.673587 724.783055   0.216    0.829    
## Gestazione      41.174410   3.860512  10.666  < 2e-16 ***
## Lunghezza      -19.913124   3.165788  -6.290 3.73e-10 ***
## Cranio          10.795854   0.417182  25.878  < 2e-16 ***
## SessoM          71.369249  11.043854   6.462 1.24e-10 ***
## I(Lunghezza^2)   0.031241   0.003269   9.555  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 270.2 on 2494 degrees of freedom
## Multiple R-squared:  0.7358, Adjusted R-squared:  0.7352 
## F-statistic:  1389 on 5 and 2494 DF,  p-value: < 2.2e-16

The ANOVA table comparing Model 7 to Model 6 reveals a substantial improvement in explained variance with the addition of the quadratic term for “Lunghezza”. The F-statistic is 91.307 with a highly significant p-value (< 2.2e-16), indicating that the added term contributes meaningfully to the model. This confirms that incorporating the non-linear relationship provides a better fit to the data.

The Bayesian Information Criterion (BIC) decreases significantly from 35220.54 in Model 6 to 35138.48 in Model 7. This reduction in BIC supports the improved balance between model complexity and explanatory power in Model 7. Hence, the quadratic term for “Lunghezza” is a valuable addition to the model, making Model 7 a better and more parsimonious choice.

anova(mod7, mod6)
## Analysis of Variance Table
## 
## Model 1: dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso + I(Lunghezza^2)
## Model 2: dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
##   Res.Df       RSS Df Sum of Sq      F    Pr(>F)    
## 1   2494 182024674                                  
## 2   2495 188688687 -1  -6664014 91.307 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod7, mod6)
##      df      BIC
## mod7  7 35138.48
## mod6  6 35220.54

Conclusion:

The final model selected is Model 7, which provides the best balance between explained variance and model complexity. While additional attempts were made to include other interactions and non-linear terms, such as Cranio², Lunghezza × Cranio, Gestazione², and Gestazione × Cranio, none of these additions significantly improved the model’s performance. Therefore, Model 7, which includes the quadratic term for “Lunghezza”, is chosen as the optimal multiple regression model.

In the next section, the assumptions of the regression model will be thoroughly verified by analyzing the residuals to ensure the validity of the model.

Analysis of Residual Assumptions

For a multiple linear regression model to be valid, its residuals must satisfy several key assumptions: 1) normal distribution, 2) mean equal to zero, 3) constant variance (homoscedasticity), 4) independence of residuals, both from each other and from the predictors. In this section, we visually assess these assumptions using diagnostic plots before performing statistical tests for confirmation.

par(mfrow = c(2,2))
plot(mod7)

Diagnostic Plot Analysis

  1. Residuals vs Fitted

This plot checks for mean zero and homoscedasticity. The residuals should be centered around zero with no discernible pattern or systematic trend. In our plot, residuals appear roughly centered around zero, but slight variance inconsistencies (heteroscedasticity) may exist, with residuals spreading more for higher fitted values.

  1. Q-Q Plot of Residuals

The Q-Q plot assesses the normality of residuals by comparing their distribution to a theoretical normal distribution. If the residuals are normal, points should align closely with the diagonal line. The plot indicates that residuals mostly follow the diagonal, with some deviations in the tails, suggesting potential slight non-normality.

  1. Scale-Location Plot

This plot checks for homoscedasticity by showing the square root of the standardized residuals against the fitted values. A horizontal line indicates constant variance. The plot reveals a fairly consistent variance but hints at potential non-constant variance for higher fitted values.

  1. Residuals vs Leverage

This plot identifies high-leverage points and influential observations using Cook’s distance. Observations like “1551” appear to have higher leverage, potentially impacting the model. These points may require further investigation to determine if they unduly influence the model’s results.

Leverage Analysis

The leverage analysis evaluates the influence of individual observations on the regression model. Observations with leverage values significantly higher than the threshold indicate that these points have a disproportionate effect on the model’s estimated coefficients. The threshold used in this analysis was calculated as 2p/n, where p is the number of predictors plus one for the intercept, and n is the total number of observations in the dataset.

lev <- hatvalues(mod7)
plot(lev)
p = sum(lev)
soglia = 2*p/N
abline(h = soglia, col = 2)

The leverage plot highlights observations with high leverage, which are data points that have a significant influence on the model due to their distance from the mean of the predictors. Observations above the red threshold line are considered to have high leverage. These points may require further investigation to assess their impact on the regression model. In this case, the plot shows that most data points have low leverage, with a few notable exceptions crossing the threshold.

Outliers Analysis

The residuals plot highlights potential outliers based on studentized residuals exceeding the thresholds of ±4, as marked by the red horizontal lines. These extreme residuals are further evaluated using the outlierTest() function, which performs a statistical test to identify significant outliers.

From the test, the observations 1551, 1306, 155, 1399, and 1694 exhibit highly significant deviations with Bonferroni-adjusted p-values below the standard significance level of 0.05. These observations warrant closer inspection as they might unduly influence the regression model’s performance.

library(car)
## Warning: package 'car' was built under R version 4.4.1
## Loading required package: carData
plot(rstudent(mod7))
abline(h=c(-4,4), col =2)

outlierTest(mod7)
##       rstudent unadjusted p-value Bonferroni p
## 1551  7.258271         5.2127e-13   1.3032e-09
## 1306  4.781064         1.8452e-06   4.6130e-03
## 155   4.664817         3.2521e-06   8.1303e-03
## 1399 -4.301603         1.7610e-05   4.4025e-02
## 1694  4.293808         1.8235e-05   4.5588e-02

Breusch-Pagan Test for Homoscedasticity

The Breusch-Pagan test evaluates the assumption of constant variance (homoscedasticity) in the residuals. The test result for Model 7 shows a test statistic of BP = 126.16 with a p-value < 2.2e-16, indicating that the null hypothesis of homoscedasticity is rejected. This suggests that the residuals exhibit heteroscedasticity, and further adjustments to the model or the use of robust standard errors might be necessary.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(mod7)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod7
## BP = 126.16, df = 5, p-value < 2.2e-16

Durbin-Watson Test for Autocorrelation

The Durbin-Watson test checks for autocorrelation in the residuals. For Model 7, the test yields DW = 1.9508 with a p-value = 0.1093. Since the p-value is greater than the standard significance level of 0.05, we fail to reject the null hypothesis, suggesting that there is no significant autocorrelation in the residuals.

dwtest(mod7)
## 
##  Durbin-Watson test
## 
## data:  mod7
## DW = 1.9508, p-value = 0.1093
## alternative hypothesis: true autocorrelation is greater than 0

Shapiro-Wilk Test for Normality

The Shapiro-Wilk test assesses the normality of the residuals. For Model 7, the test provides a test statistic of W = 0.98552 with a p-value = 2.86e-15. The extremely low p-value indicates a rejection of the null hypothesis of normality, implying that the residuals deviate significantly from a normal distribution. Further investigation or transformations might be necessary to address this deviation.

shapiro.test(residuals(mod7))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(mod7)
## W = 0.98552, p-value = 2.86e-15

Conclusion and Recommendations for Model Improvement

Based on the analyses conducted, Model 7 has been selected as the best model for prediction due to its balance between explained variance and model complexity. However, the diagnostic tests on the residuals revealed some issues:

To further improve the model, the following actions could be considered:

Despite these limitations, Model 7 provides a solid foundation for prediction. Adjustments to account for identified issues could enhance its robustness and reliability.

Application of Model 7 for Weight Prediction

The finalized regression model (Model 7) was employed to predict the weight of a newborn based on specific maternal and neonatal characteristics. For this example, the prediction was made for a male newborn (Sesso = “M”) delivered at the 39th week of gestation, with an estimated length of 500 mm and a cranial circumference of 320 mm. This highlights the practical use of the model in estimating outcomes under specified conditions, yielding a predicted weight of approximately 3142 grams.

new_data <- data.frame(
  Gestazione = 39, 
  Lunghezza = 500,
  Cranio = 320,              
  Sesso = "M" )

weight_prediction <- predict(mod7, newdata = new_data)
weight_prediction
##        1 
## 3142.127