In the initial phase of the analysis, the dataset containing information about neonates was loaded and explored to understand its structure, size, and the characteristics of its variables.
The dataset comprises 2500 observations and 10 variables, each providing important information about neonatal characteristics, maternal factors, and delivery details. The variables are a mix of numerical and categorical data, offering a comprehensive view for analysis.
dati <- read.csv("/Users/matteocalzametre/Desktop/neonati.csv")
dim(dati)
## [1] 2500 10
names(dati)
## [1] "Anni.madre" "N.gravidanze" "Fumatrici" "Gestazione" "Peso"
## [6] "Lunghezza" "Cranio" "Tipo.parto" "Ospedale" "Sesso"
head(dati)
## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio Tipo.parto
## 1 26 0 0 42 3380 490 325 Nat
## 2 21 2 0 39 3150 490 345 Nat
## 3 34 3 0 38 3640 500 375 Nat
## 4 28 1 0 41 3690 515 365 Nat
## 5 20 0 0 38 3700 480 335 Nat
## 6 32 0 0 40 3200 495 340 Nat
## Ospedale Sesso
## 1 osp3 M
## 2 osp1 F
## 3 osp2 M
## 4 osp2 M
## 5 osp3 F
## 6 osp2 F
str(dati)
## 'data.frame': 2500 obs. of 10 variables:
## $ Anni.madre : int 26 21 34 28 20 32 26 25 22 23 ...
## $ N.gravidanze: int 0 2 3 1 0 0 1 0 1 0 ...
## $ Fumatrici : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gestazione : int 42 39 38 41 38 40 39 40 40 41 ...
## $ Peso : int 3380 3150 3640 3690 3700 3200 3100 3580 3670 3700 ...
## $ Lunghezza : int 490 490 500 515 480 495 480 510 500 510 ...
## $ Cranio : int 325 345 375 365 335 340 345 349 335 362 ...
## $ Tipo.parto : chr "Nat" "Nat" "Nat" "Nat" ...
## $ Ospedale : chr "osp3" "osp1" "osp2" "osp2" ...
## $ Sesso : chr "M" "F" "M" "M" ...
sum(is.na(dati))
## [1] 0
Below is a description of the key variables, along with their types:
Anni.madre): Quantitative
continuous variable representing the mother’s age in years.
N.gravidanze):
Quantitative discrete variable indicating the total number of
pregnancies experienced by the mother.
Fumatrici): Categorical
binary variable indicating whether the mother smoked during pregnancy
(yes or no).
Gestazione):
Quantitative discrete variable capturing the length of pregnancy in
weeks.
Peso): Quantitative
continuous variable representing the neonate’s weight in grams (target
variable).
Lunghezza): Quantitative
continuous variable measuring the length of the neonate in millimeters.
Cranio): Quantitative
continuous variable representing the neonate’s head circumference in
millimeters.
Tipo.parto): Categorical
nominal variable specifying whether the delivery was natural or
cesarean.
Ospedale): Categorical nominal
variable identifying the hospital where the delivery took place.
Sesso): Categorical
binary variable indicating whether the neonate is male (M) or female
(F).
The dataset provides a detailed view of maternal factors, neonatal measurements, and delivery information, which are crucial for understanding birth outcomes. The variables offer diverse perspectives, with numerical variables supporting quantitative analysis and categorical variables enriching the dataset with qualitative information.
The target variable is neonate’s weight
(Peso), which will be the focus of our predictive
modeling efforts later in the analysis.
In this section, we conducted a detailed analysis of key summary statistics for all numerical variables in the dataset. This includes measures of central tendency (mean and median), variability (standard deviation, interquartile range, and percentiles), and shape (skewness and kurtosis). These metrics provide critical insights into the characteristics of the data distribution.
library(moments)
library(knitr)
summary_stats <- sapply(dati[, c("Anni.madre","N.gravidanze","Gestazione","Peso", "Lunghezza", "Cranio")], function(x){
c(Mean = round(mean(x),2),
Median = round(median(x),2),
SD = round(sd(x),2),
IQR = round(IQR(x),2),
percentil = round(quantile(x),2),
Skewness = round(skewness(x),2),
Kurtosis = round(kurtosis(x)-3))
})
kable(as.data.frame(t(summary_stats)), caption = "Summary Stats")
| Mean | Median | SD | IQR | percentil.0% | percentil.25% | percentil.50% | percentil.75% | percentil.100% | Skewness | Kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Anni.madre | 28.16 | 28 | 5.27 | 7 | 0 | 25 | 28 | 32 | 46 | 0.04 | 0 |
| N.gravidanze | 0.98 | 1 | 1.28 | 1 | 0 | 0 | 1 | 1 | 12 | 2.51 | 11 |
| Gestazione | 38.98 | 39 | 1.87 | 2 | 25 | 38 | 39 | 40 | 43 | -2.07 | 8 |
| Peso | 3284.08 | 3300 | 525.04 | 630 | 830 | 2990 | 3300 | 3620 | 4930 | -0.65 | 2 |
| Lunghezza | 494.69 | 500 | 26.32 | 30 | 310 | 480 | 500 | 510 | 565 | -1.51 | 6 |
| Cranio | 340.03 | 340 | 16.43 | 20 | 235 | 330 | 340 | 350 | 390 | -0.79 | 3 |
These findings help provide an initial understanding of the numerical variables’ distributions. Further graphical exploration will confirm and visualize these observations, especially for key variables like neonate weight and length.
The following section provides a frequency distribution analysis of categorical variables in the dataset. For each variable, we present the absolute and relative frequencies, and in some cases, cumulative frequencies. These distributions help us understand the structure and proportion of the data for qualitative attributes.
Tipo.parto)
The variable “Tipo.parto” categorizes deliveries as either natural
(Nat) or cesarean (Ces). Below is the
frequency distribution:
Observation: We note a clear prevalence of natural deliveries over cesarean deliveries.
N <- dim(dati)[1]
freq_ass <- table(dati$Tipo.parto)
freq_rel <- round(table(dati$Tipo.parto)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
## freq_ass freq_rel
## Ces 728 0.29
## Nat 1772 0.71
Fumatrici)
The variable “Fumatrici” captures whether the mother smoked during pregnancy. The data was recoded to “yes” for smokers and “no” for non-smokers. The results are as follows:
Observation: The dataset shows an overwhelming prevalence of non-smoking mothers (96%), with only 4% smokers.
dati$Fumatrici <- ifelse(dati$Fumatrici == 1, "yes", "no")
freq_ass <- table(dati$Fumatrici)
freq_rel <- round(table(dati$Fumatrici)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
## freq_ass freq_rel
## no 2396 0.96
## yes 104 0.04
N.gravidanze)
This variable captures the number of pregnancies. An initial analysis showed a wide range of values, which were later grouped into intervals:
freq_ass <- table(dati$N.gravidanze)
freq_rel <- round(table(dati$N.gravidanze)/N,2)
freq_ass_cum <- cumsum(freq_ass)
freq_rel_cum <- cumsum(freq_rel)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel, freq_ass_cum, freq_rel_cum))
distr_freq
## freq_ass freq_rel freq_ass_cum freq_rel_cum
## 0 1096 0.44 1096 0.44
## 1 818 0.33 1914 0.77
## 2 340 0.14 2254 0.91
## 3 150 0.06 2404 0.97
## 4 48 0.02 2452 0.99
## 5 21 0.01 2473 1.00
## 6 11 0.00 2484 1.00
## 7 1 0.00 2485 1.00
## 8 8 0.00 2493 1.00
## 9 2 0.00 2495 1.00
## 10 3 0.00 2498 1.00
## 11 1 0.00 2499 1.00
## 12 1 0.00 2500 1.00
Observation: To facilitate interpretation, the variable was grouped into 5 classes based on frequency distribution, highlighting that most mothers had 0 or 1 pregnancy.
dati$N.gravidanze <- cut(dati$N.gravidanze,
breaks = c(-Inf,0,1,2,3,Inf),
labels = c("0","1","2","3",">3"))
freq_ass <- table(dati$N.gravidanze)
freq_rel <- round(table(dati$N.gravidanze)/N,2)
freq_ass_cum <- cumsum(freq_ass)
freq_rel_cum <- cumsum(freq_rel)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel, freq_ass_cum, freq_rel_cum))
distr_freq
## freq_ass freq_rel freq_ass_cum freq_rel_cum
## 0 1096 0.44 1096 0.44
## 1 818 0.33 1914 0.77
## 2 340 0.14 2254 0.91
## 3 150 0.06 2404 0.97
## >3 96 0.04 2500 1.01
Ospedale)
The dataset includes deliveries from three hospitals (osp1,
osp2, osp3). The distribution of deliveries is
as follows:
Observation: The deliveries are evenly distributed across the three hospitals, showing no significant imbalance.
freq_ass <- table(dati$Ospedale)
freq_rel <- round(table(dati$Ospedale)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
## freq_ass freq_rel
## osp1 816 0.33
## osp2 849 0.34
## osp3 835 0.33
Sesso)
The gender of neonates is categorized as male (M) or female
(F). The distribution is nearly equal:
Observation: The gender distribution is perfectly balanced, with an equal proportion of males and females.
freq_ass <- table(dati$Sesso)
freq_rel <- round(table(dati$Sesso)/N,2)
distr_freq <- as.data.frame(cbind(freq_ass, freq_rel))
distr_freq
## freq_ass freq_rel
## F 1256 0.5
## M 1244 0.5
The summary statistics for the “Weight” variable reveal a median value of 3300 grams and a mean of 3284.08 grams, with a standard deviation of 525.04 grams. The interquartile range (IQR) is 630 grams, indicating moderate variability in the central 50% of the data. The skewness of -0.65 suggests a slight negative asymmetry, indicating a higher frequency of heavier weights and a tail skewed toward lighter weights. The kurtosis of 2.03 shows a leptokurtic distribution, with more pronounced peaks and tails compared to a normal distribution.
These observations highlight that while the data is relatively symmetric around the mean and median, there are heavier weights with fewer extremely low weights, as evidenced by the negative skewness. The leptokurtic nature also suggests some concentration around the mean, with a few more extreme values influencing the distribution.
summary(dati$Peso)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 830 2990 3300 3284 3620 4930
round(sd(dati$Peso),2)
## [1] 525.04
round(skewness(dati$Peso),2)
## [1] -0.65
round(kurtosis(dati$Peso)-3,2)
## [1] 2.03
The Shapiro-Wilk Normality Test was applied to assess whether the distribution of birth weight follows a normal distribution. The null hypothesis (H0) of the test assumes that the data are normally distributed. The resulting p-value was significantly lower than the significance level of 0.05, leading us to reject the null hypothesis and conclude that the variable does not follow a normal distribution.
shapiro.test(dati$Peso)
##
## Shapiro-Wilk normality test
##
## data: dati$Peso
## W = 0.97066, p-value < 2.2e-16
To complement the statistical indices, we performed a graphical analysis of the birth weight variable using three visualizations:
library(ggplot2)
library(gridExtra)
p1 <- ggplot(dati, aes(x = Peso)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
geom_density(color = "blue", linewidth = 1) +
ggtitle("Histogram with Density Curve") +
xlab("Weigth")+
theme_minimal()
p2 <- ggplot(dati, aes(sample = Peso)) +
stat_qq(color = "blue") +
stat_qq_line(color = "red") +
ggtitle("QQ Plot of Neonatal Weight") +
theme_minimal()
p3 <- ggplot(dati, aes(x = "", y = Peso)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
ggtitle("Boxplot of Neonatal Weight") +
ylab("Weight (grams)") +
theme_minimal()
grid.arrange(p1,p2,p3, ncol=3)
The combination of statistical indices, normality testing, and graphical analysis provides a comprehensive view of the birth weight variable. The results indicate that the distribution is slightly negatively skewed with a high kurtosis, and it significantly deviates from normality. These findings will be critical in further modeling and analysis steps, as they guide the choice of appropriate statistical methods to handle the observed deviations.
The summary statistics for the “Length” variable indicate a median value of 500 mm and a mean of 494.7 mm, with a standard deviation of 26.32 mm. The interquartile range (IQR) is 30 mm, reflecting the spread of the middle 50% of the data. The skewness of -1.51 reveals a negative asymmetry, indicating a higher frequency of larger lengths and a tail skewed toward smaller lengths. The kurtosis of 6.49 highlights a leptokurtic distribution, suggesting more pronounced peaks and tails compared to a normal distribution.
summary(dati$Lunghezza)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 310.0 480.0 500.0 494.7 510.0 565.0
round(sd(dati$Lunghezza),2)
## [1] 26.32
round(skewness(dati$Lunghezza),2)
## [1] -1.51
round(kurtosis(dati$Lunghezza)-3,2)
## [1] 6.49
library(ggplot2)
library(gridExtra)
p1 <- ggplot(dati, aes(x = Lunghezza)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
geom_density(color = "blue", linewidth = 1) +
ggtitle("Histogram with Density Curve") +
xlab("Length")+
theme_minimal()
p2 <- ggplot(dati, aes(sample = Lunghezza)) +
stat_qq(color = "blue") +
stat_qq_line(color = "red") +
ggtitle("QQ Plot of Neonatal Length") +
theme_minimal()
p3 <- ggplot(dati, aes(x = "", y = Lunghezza)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
ggtitle("Boxplot of Neonatal Length") +
ylab("Length (mm)") +
theme_minimal()
grid.arrange(p1, p2, p3, ncol = 3)
The graphical analysis supports these findings:
Overall, the “Length” variable demonstrates a negatively skewed and leptokurtic distribution, with most values centered around the median but with some extreme lower values influencing the distribution.
During the preliminary analysis, we identified anomalous values in the
Anni.madre variable, specifically two records indicating
ages of 0 and 1 year. These values are implausible for maternal age and
likely resulted from data entry errors. Rather than removing these
records and potentially losing valuable information about other
variables, we opted to address the anomalies through data imputation.
The median value of the variable Anni.madre, which is
robust to the influence of outliers, was used to replace these
implausible values. This choice ensures that the overall distribution
and central tendency of the data remain representative while addressing
inaccuracies.
After imputing the anomalous values, we calculated key descriptive
statistics for Anni.madre:
dati$Anni.madre[dati$Anni.madre == 0 | dati$Anni.madre == 1] <- median(dati$Anni.madre)
summary(dati$Anni.madre)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 25.00 28.00 28.19 32.00 46.00
round(sd(dati$Anni.madre),2)
## [1] 5.22
round(skewness(dati$Anni.madre),2)
## [1] 0.15
round(kurtosis(dati$Anni.madre)-3,2)
## [1] -0.1
p1 <- ggplot(dati, aes(x = Anni.madre)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
geom_density(color = "blue", linewidth = 1) +
ggtitle("Histogram with Density Curve") +
xlab("Maternal Age (year)")+
theme_minimal()
p2 <- ggplot(dati, aes(x = "", y = Anni.madre)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
ggtitle("Boxplot of Maternal Age") +
ylab("Maternal Age (year)") +
theme_minimal()
grid.arrange(p1, p2, ncol = 2)
We further examined the distribution of maternal age using:
To explore the distribution of the gestational weeks variable, we first calculated summary statistics including measures of central tendency, variability, and shape. The results are summarized below:
summary(dati$Gestazione)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 38.00 39.00 38.98 40.00 43.00
round(sd(dati$Gestazione),2)
## [1] 1.87
round(skewness(dati$Gestazione),2)
## [1] -2.07
round(kurtosis(dati$Gestazione)-3,2)
## [1] 8.26
ggplot(dati, aes(x = Gestazione)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 1, color = "black", fill = "lightblue") +
geom_density(color = "blue", linewidth = 1) +
ggtitle("Histogram with Density Curve") +
xlab("Gestation (weeks)")+
theme_minimal()
The histogram with a density curve reveals a negatively skewed distribution, consistent with the calculated skewness. The data shows a concentration of gestational weeks near the 40-week mark, with a notable peak. The presence of sharp peaks suggests some common gestation durations, likely reflective of standard medical practices or recording conventions.
Overall, the analysis highlights the distribution’s asymmetry and peakedness, which could warrant further investigation to understand medical or procedural influences on the data.
The analysis of the cranial circumference variable highlights the following statistical properties:
summary(dati$Cranio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 235 330 340 340 350 390
round(sd(dati$Cranio),2)
## [1] 16.43
round(skewness(dati$Cranio),2)
## [1] -0.79
round(kurtosis(dati$Cranio)-3,2)
## [1] 2.95
The following plots provide a visual analysis of the cranial circumference distribution:
p1 <- ggplot(dati, aes(x = Cranio)) +
geom_histogram(aes(y = after_stat(density)), bins = 30, color = "black", fill = "lightblue") +
geom_density(color = "blue", linewidth = 1) +
ggtitle("Histogram with Density Curve") +
xlab("Cranial Circumference (mm)")+
theme_minimal()
p2 <- ggplot(dati, aes(sample = Cranio)) +
stat_qq(color = "blue") +
stat_qq_line(color = "red") +
ggtitle("QQ Plot of Cranial Circumference") +
theme_minimal()
p3 <- ggplot(dati, aes(x = "", y = Cranio)) +
geom_boxplot(fill = "lightblue", color = "black", outlier.color = "red", outlier.size = 2) +
ggtitle("Boxplot of Cranial Circumference") +
ylab("Cranial Circumference (mm)") +
theme_minimal()
grid.arrange(p1, p2, p3, ncol = 3)
Overall, the analysis of cranial circumference reveals a moderately variable distribution with a slight left skew and a peaked profile.
The graphical analysis of qualitative variables is performed to understand the distribution of categories within each variable. Bar plots were used to visualize the frequencies of each category, providing an intuitive representation of the data.
p1 <- ggplot(dati, aes(x = N.gravidanze)) +
geom_bar(fill = "lightblue", color = "black") +
ggtitle("Distribution of Number of Pregnancies") +
xlab("Number of Pregnancies") +
ylab("Frequency") +
theme_minimal()
p2 <- ggplot(dati, aes(x = Fumatrici)) +
geom_bar(fill = "lightgreen", color = "black") +
ggtitle("Distribution of Smokers") +
xlab("Smokers (yes/no)") +
ylab("Frequency") +
theme_minimal()
p3 <- ggplot(dati, aes(x = Tipo.parto)) +
geom_bar(fill = "lightcoral", color = "black") +
ggtitle("Distribution of Delivery Type") +
xlab("Delivery Type (Ces/Nat)") +
ylab("Frequency") +
theme_minimal()
p4 <- ggplot(dati, aes(x = Ospedale)) +
geom_bar(fill = "lightpink", color = "black") +
ggtitle("Distribution of Hospital") +
xlab("Hospital") +
ylab("Frequency") +
theme_minimal()
p5 <- ggplot(dati, aes(x = Sesso)) +
geom_bar(fill = "lightcyan", color = "black") +
ggtitle("Distribution of Gender") +
xlab("Gender (F/M)") +
ylab("Frequency") +
theme_minimal()
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)
From the graphical analysis, we observe the following key findings:
These findings provide a comprehensive overview of the dataset’s categorical variables and validate the consistency of the data, serving as a strong foundation for further analyses.
In this section, we analyze the correlation between the numerical variables in the dataset. A correlation matrix was computed to identify potential linear relationships between variables. Visualizations, such as the correlation plot, were used to enhance the interpretation of these relationships. Particular attention was given to correlations with the target variable, Peso (weight), to explore potential predictors or influential variables.
library(ggcorrplot)
numerical_vars <- dati[, c("Peso", "Lunghezza", "Cranio", "Anni.madre", "Gestazione")]
cor_matrix <- as.data.frame(round(cor(numerical_vars),2))
cor_matrix
## Peso Lunghezza Cranio Anni.madre Gestazione
## Peso 1.00 0.80 0.70 -0.02 0.59
## Lunghezza 0.80 1.00 0.60 -0.06 0.62
## Cranio 0.70 0.60 1.00 0.02 0.46
## Anni.madre -0.02 -0.06 0.02 1.00 -0.13
## Gestazione 0.59 0.62 0.46 -0.13 1.00
Overall, the variables Lunghezza, Cranio, and Gestazione emerge as key factors potentially influencing the target variable Peso. These insights will be crucial for further modeling and prediction tasks.
ggcorrplot(cor_matrix,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("blue", "white", "red"),
title = "Correlation Matrix",
ggtheme = theme_minimal())
The computed correlation coefficient between Weight and Length was 0.796, indicating a strong positive linear relationship.
cor(dati$Peso, dati$Lunghezza)
## [1] 0.7960368
A scatter plot was generated to visually explore the correlation between Peso and Lunghezza. The points display a clear upward trend, further confirming the positive association between these variables.
ggplot(dati, aes(x = Lunghezza, y = Peso)) +
geom_point(color = "black") +
ggtitle("Scatter Plot between Weight and Length") +
xlab("Length (mm)") +
ylab("Weight (grams)") +
theme_minimal()
A Pearson correlation test was conducted to assess the statistical significance of the observed correlation. The results (p-value < 2.2e-16) strongly reject the null hypothesis, suggesting that the correlation is highly significant.
cor_test <- cor.test(dati$Lunghezza, dati$Peso)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Lunghezza and dati$Peso
## t = 65.735, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7812132 0.8099631
## sample estimates:
## cor
## 0.7960368
The computed Pearson’s correlation coefficient between Peso and Cranio is 0.704, indicating a strong positive linear relationship. This suggests that as the cranial diameter increases, the birth weight also tends to increase.
cor(dati$Peso, dati$Cranio)
## [1] 0.7048015
The scatter plot between Peso and Cranio shows a clear upward trend. This visualization supports the observed positive association, further reinforcing the linear relationship suggested by the correlation coefficient.
ggplot(dati, aes(x = Cranio, y = Peso))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Wight and Cranial Circumference")+
xlab("Cranial Circumference (mm)")+
ylab("Weight (grams)")+
theme_minimal()
A Pearson correlation test was performed to assess the statistical significance of the correlation between Weight and Cranial Circumference. The test results (p-value < 2.2e-16) indicate a highly significant correlation, confirming that the observed relationship is unlikely to be due to chance.
cor_test <- cor.test(dati$Peso, dati$Cranio)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Peso and dati$Cranio
## t = 49.656, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6845119 0.7240000
## sample estimates:
## cor
## 0.7048015
The Pearson’s correlation coefficient between Weight and Gestation is 0.592, representing a moderately strong positive linear relationship. This implies that longer gestational periods are generally associated with higher birth weights.
cor(dati$Peso, dati$Gestazione)
## [1] 0.5917687
A scatter plot of Peso versus Gestazione illustrates the positive relationship, with an upward trend indicating that higher gestational weeks contribute to greater birth weights.
ggplot(dati, aes(x = Gestazione, y = Peso))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Weight and Gestation")+
xlab("Gestation (weeks)")+
ylab("Weight (grams)")+
theme_minimal()
A Pearson correlation test confirmed the statistical significance of the observed relationship between Weight and Gestation. The highly significant p-value (< 2.2e-16) rejects the null hypothesis, affirming the positive correlation between these variables.
cor_test <- cor.test(dati$Peso, dati$Gestazione)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Peso and dati$Gestazione
## t = 36.691, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5656894 0.6166655
## sample estimates:
## cor
## 0.5917687
The computed Pearson correlation coefficient between Length and Cranial Circumference is 0.603, suggesting a moderate positive relationship between these two variables.
cor(dati$Lunghezza, dati$Cranio)
## [1] 0.603341
The scatter plot for Length and Cranial Circumference demonstrates a clear upward trend, confirming the positive association between these variables.
ggplot(dati, aes(x = Lunghezza, y = Cranio))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Weight and Cranial Circumference")+
xlab("Lenght (mm)")+
ylab("Cranial Circumference (mm)")+
theme_minimal()
A Pearson correlation test confirms the statistical significance of the correlation between Lenght and Cranial Circumference. The p-value is < 2.2e-16, strongly rejecting the null hypothesis, indicating a meaningful association.
cor_test <- cor.test(dati$Lunghezza, dati$Cranio)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Lunghezza and dati$Cranio
## t = 37.813, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5778048 0.6276970
## sample estimates:
## cor
## 0.603341
The correlation coefficient between Length and Gestation is 0.619, which represents a moderate positive linear relationship.
cor(dati$Lunghezza, dati$Gestazione)
## [1] 0.6189204
The scatter plot for Length and Gestation illustrates a positive trend, supporting the observed correlation between the two variables.
ggplot(dati, aes(x = Lunghezza, y = Gestazione))+
geom_point(color = "black")+
ggtitle("Scatter between Lenght and Gestation")+
xlab("Length (mm)")+
ylab("Gestation (weeks)")+
theme_minimal()
The significance test for the correlation between Length and Gestation yields a p-value of < 2.2e-16, providing strong evidence for the significance of the relationship.
cor_test <- cor.test(dati$Lunghezza, dati$Gestazione)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Lunghezza and dati$Gestazione
## t = 39.383, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5941334 0.6425332
## sample estimates:
## cor
## 0.6189204
The correlation coefficient between Cranial Circumference and Gestation was calculated to be 0.4608, indicating a moderate positive correlation.
cor(dati$Cranio, dati$Gestazione)
## [1] 0.4608289
Visualization of the Relationship: The scatter plot revealed a generally upward trend, supporting the moderate positive relationship between the two variables.
ggplot(dati, aes(x= Cranio, y = Gestazione))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Cranial Circumference and Gestation")+
xlab("Cranial Circumference (mm)")+
ylab("Gestation (weeks)")+
theme_minimal()
Significance Testing: The Pearson correlation test results showed a p-value < 2.2e-16, suggesting the correlation is statistically significant.
cor_test <- cor.test(dati$Cranio, dati$Gestazione)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Cranio and dati$Gestazione
## t = 25.952, df = 2498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4293834 0.4911585
## sample estimates:
## cor
## 0.4608289
Calculation of Pearson’s Correlation Coefficient: The correlation coefficient for Maternal Age and Gestation was -0.1349, indicating a weak negative correlation.
cor(dati$Anni.madre, dati$Gestazione)
## [1] -0.1349262
Visualization of the Relationship: The scatter plot did not show a pronounced trend, consistent with the weak negative correlation observed.
ggplot(dati, aes(x = Gestazione, y = Anni.madre))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Maternal Age and Gestation")+
xlab("Gestation (weeks)")+
ylab("Maternal Age")+
theme_minimal()
Significance Testing: The Pearson correlation test yielded a p-value of 1.253e-11, which indicates the weak correlation is statistically significant.
cor_test <- cor.test(dati$Anni.madre, dati$Gestazione)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Anni.madre and dati$Gestazione
## t = -6.8058, df = 2498, p-value = 1.253e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17321275 -0.09623254
## sample estimates:
## cor
## -0.1349262
Calculation of Pearson’s Correlation Coefficient: The correlation coefficient between Maternal Age and Cranial Circumference was 0.0162. This suggests no meaningful correlation between the two variables.
cor(dati$Anni.madre, dati$Cranio)
## [1] 0.01620269
Visualization of the Relationship: The scatter plot showed a random distribution of points, reflecting the negligible correlation.
ggplot(dati, aes(x = Cranio, y = Anni.madre))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Maternal Age and Cranial Circumference")+
xlab("Cranial Circumference (mm)")+
ylab("Maternal Age")+
theme_minimal()
Significance Testing: The correlation test had a p-value of 0.4181, indicating the lack
cor_test <- cor.test(dati$Anni.madre, dati$Cranio)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Anni.madre and dati$Cranio
## t = 0.80992, df = 2498, p-value = 0.4181
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02301464 0.05537024
## sample estimates:
## cor
## 0.01620269
Calculation of Pearson’s Correlation Coefficient: The correlation coefficient for Maternal Age and Weight was -0.0238, pointing to a very weak negative correlation.
cor(dati$Anni.madre, dati$Peso)
## [1] -0.02377346
Visualization of the Relationship: The scatter plot displayed no clear trend, aligning with the weak correlation.
ggplot(dati, aes(x = Peso, y = Anni.madre))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Maternal Age and Weight")+
xlab("Weight (grams)")+
ylab("Maternal Age")+
theme_minimal()
Significance Testing: The p-value of 0.2347 indicates the correlation is not statistically significant.
cor_test <- cor.test(dati$Anni.madre, dati$Peso)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Anni.madre and dati$Peso
## t = -1.1885, df = 2498, p-value = 0.2347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06291754 0.01544365
## sample estimates:
## cor
## -0.02377346
Calculation of Pearson’s Correlation Coefficient: The correlation coefficient for Maternal Age and Length was -0.0649, indicating a weak negative correlation.
cor(dati$Anni.madre, dati$Lunghezza)
## [1] -0.06495563
Visualization of the Relationship: The scatter plot showed a slight downward trend, consistent with the weak negative correlation.
ggplot(dati, aes(x = Lunghezza, y = Anni.madre))+
geom_point(color = "black")+
ggtitle("Scatter Plot between Maternal Age and Length")+
xlab("Length (mm)")+
ylab("Maternal Age")+
theme_minimal()
Significance Testing: The Pearson correlation test produced a p-value of 0.001156, indicating that the weak correlation is statistically significant.
cor_test <- cor.test(dati$Anni.madre, dati$Lunghezza)
cor_test
##
## Pearson's product-moment correlation
##
## data: dati$Anni.madre and dati$Lunghezza
## t = -3.2534, df = 2498, p-value = 0.001156
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10389379 -0.02581865
## sample estimates:
## cor
## -0.06495563
Correlation Analysis between Target Variable and Qualitative Variables/h3>
The violin plot presented above illustrates the distribution of neonate weights across the three hospitals: osp1, osp2, and osp3. The plot combines density estimation with boxplots, allowing for a clear visualization of the central tendency and spread of weights within each hospital. Each violin shape indicates the distribution of weights, while the embedded boxplot highlights the median and interquartile range. Observing the plot, it seems that the distributions of weights are relatively similar across the hospitals, with minor differences in the spread and density.
ggplot(dati, aes(x = Ospedale, y = Peso, fill = Ospedale)) +
geom_violin(trim = FALSE, alpha = 0.6) +
geom_boxplot(width = 0.1, color = "black", outlier.shape = NA) +
labs(title = "Distribution of Weight by Hospital",
x = "Hospital", y = "Weight (grams)") +
theme_minimal() +
theme(legend.position = "none")
To statistically verify whether there are significant differences in the mean weights of neonates across the three hospitals, an ANOVA test was conducted. The summary of the ANOVA results is shown above. The F-statistic value is 1.699, and the associated p-value is 0.183. Since the p-value is greater than the conventional significance level (e.g., 0.05), we fail to reject the null hypothesis. This indicates that there is no statistically significant difference in the mean weights of neonates among the three hospitals.
anova_result <- aov(Peso ~ Ospedale, data = dati)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Ospedale 2 936237 468118 1.699 0.183
## Residuals 2497 687952305 275512
The violin plot visualizes the distribution of neonatal weight (in grams) across different categories of the number of pregnancies.The plot shows a consistent central tendency across categories, with no notable outliers or significant differences in shape or spread of the distributions. This suggests that the number of pregnancies does not strongly influence neonatal weight in this dataset.
ggplot(dati, aes(x=N.gravidanze,y=Peso, fill = N.gravidanze))+
geom_violin(trim = FALSE, alpha=0.6) +
geom_boxplot(width = 0.1, color = "black", outlier.shape=NA)+
labs(title = "Distribution of Weight by Number of Pregnancies",
x = "Number of Pregnancies", y = "Weight (grams)")+
theme_minimal()+
theme(legend.position = "none")
The ANOVA test was conducted to statistically evaluate if the mean neonatal weights differ significantly across the categories of the number of pregnancies. The results indicate a p-value of 0.755, which is well above the standard significance level of 0.05. This suggests that we fail to reject the null hypothesis, and there is no statistical evidence of significant differences in the mean weights between the groups defined by the number of pregnancies.
anova_result <- aov(Peso ~ N.gravidanze, data = dati)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## N.gravidanze 4 522163 130541 0.473 0.755
## Residuals 2495 688366379 275898
The violin plot provides a detailed visualization of the distribution of neonatal weights for the two groups: mothers who smoke and mothers who do not smoke. The plot reveals slight differences in weight distribution between the groups, with non-smoking mothers generally having neonates with higher weights. However, the overlapping distributions suggest the differences may not be statistically significant.
ggplot(dati, aes(x=Fumatrici, y=Peso, fill=Fumatrici))+
geom_violin(trim = FALSE, alpha=0.6)+
geom_boxplot(width = 0.1, color = "black", outlier.shape=NA)+
labs(title = "Distribution of Weight by Smokers",
x = "Smokers (yes/no)", y = "Weight (grams)")+
theme_minimal()+
theme(legend.position = "none")
The Welch two-sample t-test was performed to assess whether the mean neonatal weight significantly differs between the two groups (smokers and non-smokers). The test yielded a t-value of 1.034 with a p-value of 0.3033. This result indicates that there is no statistically significant difference in neonatal weight between the two groups at a 5% significance level.
t.test(Peso~Fumatrici, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -45.61354 145.22674
## sample estimates:
## mean in group no mean in group yes
## 3286.153 3236.346
The violin plot above displays the distribution of neonatal weight according to the type of delivery (Cesarean or Natural). The distributions of weight for both groups are visually similar, with overlapping ranges and comparable medians.
ggplot(dati, aes(x=Tipo.parto, y=Peso, fill = Tipo.parto))+
geom_violin(trim = FALSE, alpha = 0.6)+
geom_boxplot(width=0.1, color = "black", outlier.shape=NA)+
labs(title = "Distribution of Weight by Deliverry Type",
x = "Delivery Type", y = "Weight (grams)")+
theme_minimal()+
theme(legend.position = "none")
With a p-value of 0.8968, the null hypothesis cannot be rejected. This suggests that there is no statistically significant difference in the mean neonatal weight between Cesarean and Natural delivery groups.
t.test(Peso~Tipo.parto, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Tipo.parto
## t = -0.12968, df = 1493, p-value = 0.8968
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
## -46.27992 40.54037
## sample estimates:
## mean in group Ces mean in group Nat
## 3282.047 3284.916
The violin plot shows the weight distributions for males and females. Male neonates exhibit a higher median weight and a wider spread compared to females.
ggplot(dati, aes(x=Sesso, y=Peso, fill = Sesso))+
geom_violin(trim = FALSE, alpha = 0.6)+
geom_boxplot(width = 0.1, color = "black", outlier.shape = NA)+
labs(title="Distribution of Weight by Gender",
x = "Gender (M/F)", y = "Weight (grams)")+
theme_minimal()+
theme(legend.position="none")
The Welch’s t-test indicates a significant difference in mean weights (p < 0.001). Males have a higher mean weight (3408.215g) compared to females (3161.132g), confirmed by a confidence interval of (-287.1051, -207.0615).
t.test(Peso~Sesso, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M
## 3161.132 3408.215
The initial correlation matrix provides insights into the relationships between the target variable (Weight) and the potential predictors. The lower half of the matrix highlights the correlation coefficients, indicating the strength and direction of linear relationships. The scatter plots in the upper half allow us to visually inspect trends and potential non-linear relationships, which might suggest interactions or transformations for consideration in the regression model.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(numerical_vars, upper.panel = panel.smooth, lower.panel = panel.cor)
The first model includes all potential predictors. The regression summary shows the coefficients, standard errors, t-values, and p-values for each predictor. The adjusted R-squared value is high, indicating that the model explains a significant portion of the variance in the target variable. However, some predictors, such as ‘Anni.madre’ and ‘Fumatrici’, have high p-values, suggesting they are not statistically significant in the presence of other predictors.
mod1 <- lm(dati$Peso~., data = dati)
summary(mod1)
##
## Call:
## lm(formula = dati$Peso ~ ., data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1142.56 -180.14 -14.17 160.99 2619.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6727.1192 141.4412 -47.561 < 2e-16 ***
## Anni.madre 0.4905 1.1567 0.424 0.67157
## N.gravidanze1 9.0377 13.0227 0.694 0.48775
## N.gravidanze2 49.4827 17.7445 2.789 0.00533 **
## N.gravidanze3 35.2155 24.9340 1.412 0.15797
## N.gravidanze>3 63.3781 30.4873 2.079 0.03773 *
## Fumatriciyes -31.5914 27.5653 -1.146 0.25188
## Gestazione 32.5875 3.8284 8.512 < 2e-16 ***
## Lunghezza 10.2919 0.3004 34.255 < 2e-16 ***
## Cranio 10.4616 0.4270 24.500 < 2e-16 ***
## Tipo.partoNat 30.0910 12.0858 2.490 0.01285 *
## Ospedaleosp2 -10.3258 13.4419 -0.768 0.44246
## Ospedaleosp3 28.0190 13.4970 2.076 0.03800 *
## SessoM 77.2745 11.1739 6.916 5.9e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.8 on 2486 degrees of freedom
## Multiple R-squared: 0.7294, Adjusted R-squared: 0.728
## F-statistic: 515.5 on 13 and 2486 DF, p-value: < 2.2e-16
To improve the model, ‘Anni.madre’, the least significant predictor, was removed. The updated regression summary indicates that the adjusted R-squared value remains almost unchanged, suggesting that the exclusion of ‘Anni.madre’ did not adversely affect the explanatory power of the model. Several predictors remain highly significant, including ‘Lunghezza’, ‘Cranio’, and ‘Gestazione’.
mod2 <- update(mod1, ~.-Anni.madre)
summary(mod2)
##
## Call:
## lm(formula = dati$Peso ~ N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso, data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1137.07 -180.22 -14.86 160.56 2622.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6710.6687 135.9945 -49.345 < 2e-16 ***
## N.gravidanze1 10.0509 12.7995 0.785 0.43238
## N.gravidanze2 51.4583 17.1192 3.006 0.00267 **
## N.gravidanze3 38.0322 24.0290 1.583 0.11360
## N.gravidanze>3 66.8931 29.3339 2.280 0.02267 *
## Fumatriciyes -31.7750 27.5573 -1.153 0.24900
## Gestazione 32.4348 3.8108 8.511 < 2e-16 ***
## Lunghezza 10.2910 0.3004 34.259 < 2e-16 ***
## Cranio 10.4696 0.4265 24.546 < 2e-16 ***
## Tipo.partoNat 30.1395 12.0833 2.494 0.01268 *
## Ospedaleosp2 -10.2173 13.4373 -0.760 0.44711
## Ospedaleosp3 28.1662 13.4903 2.088 0.03691 *
## SessoM 77.3253 11.1714 6.922 5.66e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.8 on 2487 degrees of freedom
## Multiple R-squared: 0.7294, Adjusted R-squared: 0.7281
## F-statistic: 558.6 on 12 and 2487 DF, p-value: < 2.2e-16
The ANOVA comparison between the two models reveals that the reduction in model complexity (removing ‘Anni.madre’) does not result in a statistically significant loss of explained variance, as indicated by the high p-value in the ANOVA test. Additionally, the BIC value for Model 2 is lower than that for Model 1, further supporting the selection of the refined model as it balances variance explained and model simplicity.
anova(mod2, mod1)
## Analysis of Variance Table
##
## Model 1: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
## Model 2: dati$Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Ospedale + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2487 186411655
## 2 2486 186398173 1 13482 0.1798 0.6716
BIC(mod2, mod1)
## df BIC
## mod2 14 35252.78
## mod1 15 35260.43
The variable “Ospedale” was removed from the regression model following the analysis of the significance of predictors in Model 2. “Ospedale” was the least significant variable in Model 2, as indicated by its high p-value. Removing this variable aims to streamline the model while maintaining explanatory power.
In Model 3, the adjusted R-squared value remains at 0.7274, suggesting a high proportion of variance explained by the predictors. The residual standard error shows little change, confirming that the removal of “Ospedale” did not degrade the model’s performance. Significant predictors in Model 3 include “Gestazione,” “Lunghezza,” and “Cranio” (p-value < 0.001), along with “Tipo.partoNat” and “N.gravidanze3” (p-value < 0.05). However, predictors like “N.gravidanze1” and “Fumatrici” remain non-significant.
mod3 <- update(mod2, ~.-Ospedale)
summary(mod3)
##
## Call:
## lm(formula = dati$Peso ~ N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Tipo.parto + Sesso, data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1153.4 -180.2 -16.6 159.8 2640.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6710.8555 136.0317 -49.333 < 2e-16 ***
## N.gravidanze1 10.8914 12.8119 0.850 0.39535
## N.gravidanze2 53.4958 17.1277 3.123 0.00181 **
## N.gravidanze3 40.5364 24.0454 1.686 0.09196 .
## N.gravidanze>3 68.1029 29.3678 2.319 0.02048 *
## Fumatriciyes -33.3106 27.5890 -1.207 0.22740
## Gestazione 32.7455 3.8146 8.584 < 2e-16 ***
## Lunghezza 10.2691 0.3007 34.155 < 2e-16 ***
## Cranio 10.4795 0.4271 24.538 < 2e-16 ***
## Tipo.partoNat 30.9233 12.0968 2.556 0.01064 *
## SessoM 77.7897 11.1852 6.955 4.5e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.2 on 2489 degrees of freedom
## Multiple R-squared: 0.7284, Adjusted R-squared: 0.7274
## F-statistic: 667.7 on 10 and 2489 DF, p-value: < 2.2e-16
The ANOVA comparison between Model 3 and Model 2 reveals that removing “Ospedale” resulted in a statistically significant reduction in explained variance, as indicated by an F-statistic of 4.4037 with a p-value of 0.01233. However, the decrease in adjusted R-squared (from 0.7281 to 0.7274) and the slight change in RSS (from 187071813 to 186411655) suggest that the reduction in variance explained is minimal in practical terms.
Additionally, Model 3 has a slightly lower Bayesian Information Criterion (BIC) of 35245.97 compared to 35252.78 for Model 2. This lower BIC value indicates that Model 3 achieves a better balance between variance explained and model complexity. Given the minimal contribution of “Ospedale” to predictive power and its lack of statistical significance in Model 2, Model 3 is considered the more optimal and parsimonious choice.
anova(mod3, mod2)
## Analysis of Variance Table
##
## Model 1: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Sesso
## Model 2: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2489 187071813
## 2 2487 186411655 2 660158 4.4037 0.01233 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod3, mod2)
## df BIC
## mod3 12 35245.97
## mod2 14 35252.78
Model 4 was derived by removing the variable “Number of Pregnancies” from Model 3, as it was the least significant predictor in the previous model, confirmed by its high p-value and lack of significance. After updating the model, we analyze the regression report to assess whether the significant predictors in Model 3 remain significant in Model 4. The results indicate that all previously significant predictors maintain their significance, while the overall adjusted R-squared slightly decreases from 0.7274 to 0.7262.
mod4 <- update(mod3, ~.-N.gravidanze)
summary(mod4)
##
## Call:
## lm(formula = dati$Peso ~ Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Sesso, data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1118.88 -185.33 -16.16 161.80 2622.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6675.9335 135.7771 -49.168 < 2e-16 ***
## Fumatriciyes -27.5245 27.5783 -0.998 0.3184
## Gestazione 31.4067 3.7882 8.291 < 2e-16 ***
## Lunghezza 10.2276 0.3011 33.969 < 2e-16 ***
## Cranio 10.6379 0.4242 25.075 < 2e-16 ***
## Tipo.partoNat 29.3167 12.1132 2.420 0.0156 *
## SessoM 79.2477 11.2025 7.074 1.95e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2493 degrees of freedom
## Multiple R-squared: 0.7268, Adjusted R-squared: 0.7262
## F-statistic: 1106 on 6 and 2493 DF, p-value: < 2.2e-16
The ANOVA comparison between Model 4 and Model 3 shows a significant reduction in explained variance, with an F-statistic of 3.678 and a p-value of 0.005426, which is below the 5% significance level. This indicates that the removal of the variable “Number of Pregnancies” has led to a measurable reduction in the explained variance, even though the impact on the overall model performance appears minimal.
Despite the reduction in explained variance, Model 4 demonstrates a better Bayesian Information Criterion (BIC) score of 35229.41 compared to 35245.97 for Model 3. The lower BIC suggests that Model 4 is a more parsimonious choice, offering a better balance between model complexity and explanatory power, even after the removal of the variable “Number of Pregnancies.”
anova(mod4, mod3)
## Analysis of Variance Table
##
## Model 1: dati$Peso ~ Fumatrici + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Sesso
## Model 2: dati$Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2493 188177555
## 2 2489 187071813 4 1105742 3.678 0.005426 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod4, mod3)
## df BIC
## mod4 8 35229.41
## mod3 12 35245.97
The Model 5 was created by updating Model 4, removing the variable Fumatrici, which was found to be non-significant in the previous model. The updated model retains all previously significant predictors (Gestazione, Lunghezza, Cranio, Tipo.parto, and Sesso). All these predictors remain significant in Model 5 with p-values < 0.05. The Adjusted R-squared of the model remains almost unchanged at 0.7262, indicating that the explanatory power of the model has been maintained despite the removal of Fumatrici.
mod5 <- update(mod4, ~.-Fumatrici)
summary(mod5)
##
## Call:
## lm(formula = dati$Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto +
## Sesso, data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1118.43 -185.56 -16.07 161.53 2626.18
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6675.8084 135.7769 -49.167 < 2e-16 ***
## Gestazione 31.1917 3.7821 8.247 2.59e-16 ***
## Lunghezza 10.2412 0.3008 34.049 < 2e-16 ***
## Cranio 10.6397 0.4242 25.080 < 2e-16 ***
## Tipo.partoNat 29.1062 12.1113 2.403 0.0163 *
## SessoM 79.0670 11.2010 7.059 2.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2494 degrees of freedom
## Multiple R-squared: 0.7267, Adjusted R-squared: 0.7262
## F-statistic: 1326 on 5 and 2494 DF, p-value: < 2.2e-16
The ANOVA comparison between Model 5 and Model 4 shows that removing Fumatrici does not lead to a significant reduction in the explained variance. The F-statistic is 0.9961 with a p-value of 0.3184, which is greater than the 5% significance threshold. This indicates that the removal of Fumatrici does not significantly affect the predictive ability of the model.
The Bayesian Information Criterion (BIC) comparison reveals that Model 5 has a slightly lower BIC value (35222.59) compared to Model 4 (35229.41). This indicates that Model 5 is more parsimonious and provides a better balance between model complexity and variance explained. Consequently, Model 5 is preferred over Model 4.
anova(mod5, mod4)
## Analysis of Variance Table
##
## Model 1: dati$Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + Sesso
## Model 2: dati$Peso ~ Fumatrici + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2494 188252743
## 2 2493 188177555 1 75188 0.9961 0.3184
BIC(mod5, mod4)
## df BIC
## mod5 7 35222.59
## mod4 8 35229.41
We updated Model 5 by removing the variable “Tipo Parto”. While “Tipo Parto” was identified as significant in the regression analysis, earlier exploratory analysis suggested no strong correlation between “Tipo Parto” and “Peso”, as evidenced by the violin and box plots and the t-test analysis. These findings motivated the removal of “Tipo Parto” to evaluate the model’s performance without this variable.
mod6 <- update(mod5, ~.-Tipo.parto)
summary(mod6)
##
## Call:
## lm(formula = dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
## data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.2 -184.3 -17.6 163.3 2627.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6651.1188 135.5172 -49.080 < 2e-16 ***
## Gestazione 31.2737 3.7856 8.261 2.31e-16 ***
## Lunghezza 10.2054 0.3007 33.939 < 2e-16 ***
## Cranio 10.6704 0.4245 25.139 < 2e-16 ***
## SessoM 79.1049 11.2117 7.056 2.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275 on 2495 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7257
## F-statistic: 1654 on 4 and 2495 DF, p-value: < 2.2e-16
The ANOVA comparison between Model 6 and Model 5 indicates a statistically significant reduction in the explained variance (p-value = 0.01632), suggesting that the removal of “Tipo Parto” results in a loss of explanatory power. However, the extent of this reduction is relatively minor, as indicated by the small F-statistic value of 5.7755.
The BIC for Model 6 is slightly lower (35220.54) compared to Model 5 (35222.59), indicating that Model 6 achieves a marginally better balance between goodness-of-fit and model complexity. Based on the BIC, Model 6 is the more parsimonious choice despite the statistically significant loss of variance explained.
anova(mod6, mod5)
## Analysis of Variance Table
##
## Model 1: dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Model 2: dati$Peso ~ Gestazione + Lunghezza + Cranio + Tipo.parto + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2495 188688687
## 2 2494 188252743 1 435945 5.7755 0.01632 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod6, mod5)
## df BIC
## mod6 6 35220.54
## mod5 7 35222.59
The quadratic term for “Lunghezza” was added to Model 6 to account for the non-linear relationship observed between “Lunghezza” and “Peso”. The summary report shows that all coefficients, including the quadratic term, are statistically significant. This suggests that the inclusion of this non-linear interaction improves the model’s explanatory power. Specifically, the quadratic term for “Lunghezza” significantly enhances the prediction of “Peso”, confirming the initial observation of a curved relationship in the scatter plot matrix.
mod7 <- update(mod6, ~.+I(Lunghezza^2))
summary(mod7)
##
## Call:
## lm(formula = dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso +
## I(Lunghezza^2), data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1156.75 -182.00 -12.52 166.67 1783.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.673587 724.783055 0.216 0.829
## Gestazione 41.174410 3.860512 10.666 < 2e-16 ***
## Lunghezza -19.913124 3.165788 -6.290 3.73e-10 ***
## Cranio 10.795854 0.417182 25.878 < 2e-16 ***
## SessoM 71.369249 11.043854 6.462 1.24e-10 ***
## I(Lunghezza^2) 0.031241 0.003269 9.555 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 270.2 on 2494 degrees of freedom
## Multiple R-squared: 0.7358, Adjusted R-squared: 0.7352
## F-statistic: 1389 on 5 and 2494 DF, p-value: < 2.2e-16
The ANOVA table comparing Model 7 to Model 6 reveals a substantial improvement in explained variance with the addition of the quadratic term for “Lunghezza”. The F-statistic is 91.307 with a highly significant p-value (< 2.2e-16), indicating that the added term contributes meaningfully to the model. This confirms that incorporating the non-linear relationship provides a better fit to the data.
The Bayesian Information Criterion (BIC) decreases significantly from 35220.54 in Model 6 to 35138.48 in Model 7. This reduction in BIC supports the improved balance between model complexity and explanatory power in Model 7. Hence, the quadratic term for “Lunghezza” is a valuable addition to the model, making Model 7 a better and more parsimonious choice.
anova(mod7, mod6)
## Analysis of Variance Table
##
## Model 1: dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso + I(Lunghezza^2)
## Model 2: dati$Peso ~ Gestazione + Lunghezza + Cranio + Sesso
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2494 182024674
## 2 2495 188688687 -1 -6664014 91.307 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BIC(mod7, mod6)
## df BIC
## mod7 7 35138.48
## mod6 6 35220.54
Conclusion:
The final model selected is Model 7, which provides the best balance between explained variance and model complexity. While additional attempts were made to include other interactions and non-linear terms, such as Cranio², Lunghezza × Cranio, Gestazione², and Gestazione × Cranio, none of these additions significantly improved the model’s performance. Therefore, Model 7, which includes the quadratic term for “Lunghezza”, is chosen as the optimal multiple regression model.
In the next section, the assumptions of the regression model will be thoroughly verified by analyzing the residuals to ensure the validity of the model.
For a multiple linear regression model to be valid, its residuals must satisfy several key assumptions: 1) normal distribution, 2) mean equal to zero, 3) constant variance (homoscedasticity), 4) independence of residuals, both from each other and from the predictors. In this section, we visually assess these assumptions using diagnostic plots before performing statistical tests for confirmation.
par(mfrow = c(2,2))
plot(mod7)
This plot checks for mean zero and homoscedasticity. The residuals should be centered around zero with no discernible pattern or systematic trend. In our plot, residuals appear roughly centered around zero, but slight variance inconsistencies (heteroscedasticity) may exist, with residuals spreading more for higher fitted values.
The Q-Q plot assesses the normality of residuals by comparing their distribution to a theoretical normal distribution. If the residuals are normal, points should align closely with the diagonal line. The plot indicates that residuals mostly follow the diagonal, with some deviations in the tails, suggesting potential slight non-normality.
This plot checks for homoscedasticity by showing the square root of the standardized residuals against the fitted values. A horizontal line indicates constant variance. The plot reveals a fairly consistent variance but hints at potential non-constant variance for higher fitted values.
This plot identifies high-leverage points and influential observations using Cook’s distance. Observations like “1551” appear to have higher leverage, potentially impacting the model. These points may require further investigation to determine if they unduly influence the model’s results.
The leverage analysis evaluates the influence of individual observations
on the regression model. Observations with leverage values significantly
higher than the threshold indicate that these points have a
disproportionate effect on the model’s estimated coefficients. The
threshold used in this analysis was calculated as 2p/n,
where p is the number of predictors plus one for the
intercept, and n is the total number of observations in the
dataset.
lev <- hatvalues(mod7)
plot(lev)
p = sum(lev)
soglia = 2*p/N
abline(h = soglia, col = 2)
The leverage plot highlights observations with high leverage, which are data points that have a significant influence on the model due to their distance from the mean of the predictors. Observations above the red threshold line are considered to have high leverage. These points may require further investigation to assess their impact on the regression model. In this case, the plot shows that most data points have low leverage, with a few notable exceptions crossing the threshold.
The residuals plot highlights potential outliers based on studentized
residuals exceeding the thresholds of ±4, as marked by the red
horizontal lines. These extreme residuals are further evaluated using
the outlierTest() function, which performs a statistical
test to identify significant outliers.
From the test, the observations 1551, 1306, 155, 1399, and 1694 exhibit highly significant deviations with Bonferroni-adjusted p-values below the standard significance level of 0.05. These observations warrant closer inspection as they might unduly influence the regression model’s performance.
library(car)
## Warning: package 'car' was built under R version 4.4.1
## Loading required package: carData
plot(rstudent(mod7))
abline(h=c(-4,4), col =2)
outlierTest(mod7)
## rstudent unadjusted p-value Bonferroni p
## 1551 7.258271 5.2127e-13 1.3032e-09
## 1306 4.781064 1.8452e-06 4.6130e-03
## 155 4.664817 3.2521e-06 8.1303e-03
## 1399 -4.301603 1.7610e-05 4.4025e-02
## 1694 4.293808 1.8235e-05 4.5588e-02
The Breusch-Pagan test evaluates the assumption of constant variance (homoscedasticity) in the residuals. The test result for Model 7 shows a test statistic of BP = 126.16 with a p-value < 2.2e-16, indicating that the null hypothesis of homoscedasticity is rejected. This suggests that the residuals exhibit heteroscedasticity, and further adjustments to the model or the use of robust standard errors might be necessary.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(mod7)
##
## studentized Breusch-Pagan test
##
## data: mod7
## BP = 126.16, df = 5, p-value < 2.2e-16
The Durbin-Watson test checks for autocorrelation in the residuals. For Model 7, the test yields DW = 1.9508 with a p-value = 0.1093. Since the p-value is greater than the standard significance level of 0.05, we fail to reject the null hypothesis, suggesting that there is no significant autocorrelation in the residuals.
dwtest(mod7)
##
## Durbin-Watson test
##
## data: mod7
## DW = 1.9508, p-value = 0.1093
## alternative hypothesis: true autocorrelation is greater than 0
The Shapiro-Wilk test assesses the normality of the residuals. For Model 7, the test provides a test statistic of W = 0.98552 with a p-value = 2.86e-15. The extremely low p-value indicates a rejection of the null hypothesis of normality, implying that the residuals deviate significantly from a normal distribution. Further investigation or transformations might be necessary to address this deviation.
shapiro.test(residuals(mod7))
##
## Shapiro-Wilk normality test
##
## data: residuals(mod7)
## W = 0.98552, p-value = 2.86e-15
Based on the analyses conducted, Model 7 has been selected as the best model for prediction due to its balance between explained variance and model complexity. However, the diagnostic tests on the residuals revealed some issues:
To further improve the model, the following actions could be considered:
Despite these limitations, Model 7 provides a solid foundation for prediction. Adjustments to account for identified issues could enhance its robustness and reliability.
The finalized regression model (Model 7) was employed to predict the weight of a newborn based on specific maternal and neonatal characteristics. For this example, the prediction was made for a male newborn (Sesso = “M”) delivered at the 39th week of gestation, with an estimated length of 500 mm and a cranial circumference of 320 mm. This highlights the practical use of the model in estimating outcomes under specified conditions, yielding a predicted weight of approximately 3142 grams.
new_data <- data.frame(
Gestazione = 39,
Lunghezza = 500,
Cranio = 320,
Sesso = "M" )
weight_prediction <- predict(mod7, newdata = new_data)
weight_prediction
## 1
## 3142.127