Statistical Model for Predicting Neonatal Weight

Data Collection and Dataset Structure

getwd()

## [1] "C:/Users/PC/Desktop"

setwd("C:/Users/PC/Desktop")

library(moments)
library(ggplot2)

dati = read.csv("neonati.csv")
attach(dati)
n = nrow(dati)

the dataset is composed of 2500 objects and 10 variables:

Anni.madre: continuous quantitative variable
N. gravidanze: discrete quantitative variable
Fumatrici: dichotomous qualitative variable (coded in dummy, 0=non-smoker, 1=smoker)
Gestazione: continuous quantitative variable
Peso: continuous quantitative variable
Lunghezza: continuous quantitative variable
Cranio: continuous quantitative variable
Tipo.parto: nominal qualitative variable
Ospedale: nominal qualitative variable
Sesso: dichotomous qualitative variable

Anni.madre (Mother’s age): There are two errors in the mother’s age in the dataset, indicated as 0 and 1. We can correct them by replacing them with the median of the age, calculated without considering these values.

mediana = median(dati$Anni.madre[!(dati$Anni.madre%in% c(0,1))], na.rm= T)  #in this case with the value 28
dati$Anni.madre[dati$Anni.madre %in% c(0, 1)] = mediana


dati$Fumatrici = factor(Fumatrici, levels = c(0,1), labels = c('N', 'Y')) # 0= N, no smoker; 1= Y, smoker

gini.index = function(x){
  ni = table(x)
  fi = ni/length(x)
  fi2 = fi^2
  J= length(table(x))
  
  gini = 1-sum(fi2)
  gini.norm = gini/((J-1)/J)
  
  return(gini.norm)
} # gini function

for the numeric variables I calculate the main indices

index_calculation = function(x) {
  round(c(mean = mean(x),
    median = median(x),
    devst = sd(x),
    min = min(x),
    max = max(x),
    Q1 = quantile(x, 0.25),
    Q3 = quantile(x, 0.75),
    IQR = IQR(x),
    kurtosis = kurtosis(x),
    skewness = skewness(x),
    cv = sd(x)/mean(x)*100,
    gini = gini.index(x)
  ),2)
}

var_quant = c('Anni.madre', 'N.gravidanze', 'Gestazione', 'Peso','Lunghezza', 'Cranio')
quant = sapply(dati[var_quant], index_calculation)
kable(quant)

	Anni.madre	N.gravidanze	Gestazione	Peso	Lunghezza	Cranio
mean	28.19	0.98	38.98	3284.08	494.69	340.03
median	28.00	1.00	39.00	3300.00	500.00	340.00
devst	5.22	1.28	1.87	525.04	26.32	16.43
min	13.00	0.00	25.00	830.00	310.00	235.00
max	46.00	12.00	43.00	4930.00	565.00	390.00
Q1.25%	25.00	0.00	38.00	2990.00	480.00	330.00
Q3.75%	32.00	1.00	40.00	3620.00	510.00	350.00
IQR	7.00	1.00	2.00	630.00	30.00	20.00
kurtosis	2.90	13.99	11.26	5.03	9.49	5.95
skewness	0.15	2.51	-2.07	-0.65	-1.51	-0.79
cv	18.50	130.51	4.79	15.99	5.32	4.83
gini	0.97	0.73	0.85	1.00	0.94	0.97

Analysis of quantitative variables:

Anni.madre: The mean is 28 years. 50% of the mothers’ age is composed of an age ranging from 25 (Q1) to 32 (Q3) years. The minimum age is 13 years and the maximum age is 46 years. It has a symmetric distribution (skewness = 0.15).
N. gravidanze: The mean is 1 pregnancy. 50% of the no. of pregnancies is between 0 (Q1) and 1 (Q3). The minimum no. of pregnancies is 0 and the maximum is 12. The distribution is leptokurtic, and is strongly skewed, with a long tail towards higher values. The very high CV suggests that the number of pregnancies has extreme variability. This is consistent with the high kurtosis and strong skewness of the distribution, where most women have few pregnancies.
Gestazione: The mean is 39 weeks of gestation. 50% is composed of 38 (Q1) to 40 (Q3) weeks. Furthermore, the minimum value is 25 weeks and the maximum value is 43. The distribution is leptokurtic, it is strongly skewed towards lower values (-2.07), indicating that most pregnancies have a duration close to the mean, with a long tail towards shorter duration. The relatively low CV indicates that the duration of pregnancies is quite stable and concentrated around the mean.
Peso: The mean is 3,284kg. 50% of newborn weight is between 2,999kg (Q1) and 3,620kg (Q3). The minimum weight is 830g and the maximum weight is 4,930kg. The distribution is leptokurtic, and has a negative skewness (-0.65), indicating a longer tail on the left.
Lunghezza: The mean is 494mm. 50% of newborns have a length that is between 480mm (Q1) and 510mm (Q3). The minimum length is 310mm and the maximum is 565mm. Here too the distribution is concentrated around the mean. The distribution is asymmetric and leptokurtic.
Cranio: The mean of the skull circumference is 340. 50% of the skull circumference of newborns is between 235 (Q1) and 390 (Q3). The distribution shows some concentration around the mean value, with some extreme values. The distribution is asymmetric and leptokurtic.

par(mfrow=c(2,3), mar=c(4,4,2,1))

hist(Anni.madre, col = 'lightblue', main = "Mother's Age Distribution", 
     xlab = "Mother's Age", ylab = 'Frequency', border= 'black')

hist(N.gravidanze, col = 'lightcoral', main = 'N. Pregnancies Distribution',
     xlab = 'N. Pregnancies', ylab = 'Frequency', border= 'black')

hist(Gestazione, col = 'lightgreen', main = 'Gestation Duration',
     xlab = 'Weeks of Gestation', ylab = 'Frequency', border= 'black')

hist(Peso, col = 'lightyellow', main = 'Newborn Weight (g)',
     xlab = 'Weight (g)', ylab = 'Frequency', border= 'black')

hist(Lunghezza, col = 'lightpink', main = 'Length of newborns',
     xlab = 'Length (cm)', ylab = 'Frequency', border= 'black')

hist(Cranio, col = 'lightgray', main = 'Skull diameter',
     xlab = 'Skull diameter of newborns (cm)', ylab = 'Frequency', border= 'black')

Everything we described above is displayed in the graphs.

same thing for categorical variables

cat_calculation = function(x){
  round(c(mode = table(x),
    length = length(x),
    gini = gini.index(x)
  ),2)
}
var_qual = c('Fumatrici', 'Tipo.parto', 'Ospedale', 'Sesso')
qual = sapply(dati[var_qual], cat_calculation)
kable(qual)

	x
mode.N	2396.00
mode.Y	104.00
length	2500.00
gini	0.16

	x
mode.Ces	728.00
mode.Nat	1772.00
length	2500.00
gini	0.83

	x
mode.osp1	816
mode.osp2	849
mode.osp3	835
length	2500
gini	1

	x
mode.F	1256
mode.M	1244
length	2500
gini	1

Fumatrici: The majority of mothers are included in the ‘non-smokers’ category.
Tipo.parto: The majority of mothers who gave birth had a Natural birth.
Ospedale: The majority of mothers chose hospital 2 to give birth, although the difference compared to hospitals 1 and 3 is minimal. The Gini index, equal to 1, indicates maximum heterogeneity.
Sesso: For the variable “Sex”, the majority of mothers had a female child (1256), compared to male children (1244). The Gini index, equal to 1, reflects a very balanced distribution between the two categories, indicating an almost perfect parity between the number of male and female children, but not total homogeneity.

par(mfrow=c(2,2), mar=c(4,4,2,1))

barplot(table(dati$Fumatrici), col = 'lightblue', main = 'Smoking and no smoking mothers', 
     ylab = 'Frequency', border= 'black')

barplot(table(Tipo.parto), col = 'lightcoral', main = 'Delivery type', 
        ylab = 'Frequency', border= 'black')

barplot(table(Ospedale), col = 'lightgreen', main = 'Hospital', 
    ylab = 'Frequency', border= 'black')

barplot(table(Sesso), col = 'lightyellow', main = 'Gender', 
    ylab = 'Frequency', border= 'black')

Everything we described above is displayed in the graphs.

ggplot(dati, aes(x = Fumatrici, y = Peso)) +
  geom_boxplot(fill = 'lightblue') +
  labs(title = 'Impact of smoking on newborn weight',
       x = 'Smoking during pregnancy',
       y = 'Weight (g)') +
  theme_minimal()

Smoking during pregnancy appears to have a minimal impact on infant weight, with a slight difference between mothers who smoke and those who do not smoke. Mothers who do not smoke tend to have infants with a slightly higher weight, although the difference is not particularly marked, but it is noticeable. In both groups, outliers are observed, both with weights higher and lower than the average.

ggplot(dati, aes(x = Gestazione, y = Peso, fill = Fumatrici)) +
  geom_boxplot() +
  scale_fill_discrete(name = "Smoking during pregnancy") +
  labs(
    title = 'Impact of smoking on newborn weight',
    x = 'Weeks of gestation',
    y = 'Newborn weight (g)'
  ) +
  theme_minimal() +
  theme(
    legend.position = 'bottom',
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = 'gray90'),
    strip.text = element_text(face = 'bold')
  )

The graph shows that smoking during pregnancy may reduce the weight of the newborn, with a non-significant difference. Newborns of mothers who do not smoke have a slightly higher median weight, while those of mothers who smoke tend to have lower weights.

ggplot(dati, aes(x = as.factor(N.gravidanze), y = Peso)) +
  geom_boxplot(fill = 'lightpink') +
  labs(title = 'Newborn weight in relation to number of pregnancies',
       x = 'Number of pregnancies',
       y = 'Weight (g)') +
  theme_minimal()

The relationship between the number of pregnancies and newborn weight does not show a clear trend. The median newborn weight remains relatively stable until the fifth pregnancy.

ggplot(dati, aes(x = Tipo.parto, y = Peso)) +
  geom_boxplot(fill = "lightgreen") +
  theme_minimal() +
  labs(title = "Newborn weight by type of delivery",
       x = "Type of delivery",
       y = "Weight (g)")

Comparison of the type of delivery (caesarean and natural) and the weight of the newborn does not reveal significant differences. The medians of the weights are almost identical and the distributions are very similar in the two groups. In both delivery methods, outliers are observed, indicating extreme values of weight, but without a significant difference between the two groups.

ggplot(dati, aes(x = Anni.madre, y = Peso)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'blue') +
  theme_minimal() +
  labs(title = "Relationship between newborn weight and mother's age",
  x = "Mother's age",
  y = 'Weight (g)')

## `geom_smooth()` using formula = 'y ~ x'

The analysis of the relationship between maternal age and newborn weight does not show a clear correlation. The blue regression line is almost horizontal, indicating the absence of a significant trend. The points are widely dispersed around the line, showing a large variability in weight for all maternal ages.

ggplot(dati, aes(x = Ospedale, y = Peso, fill = Tipo.parto)) +
  geom_col(position = 'dodge') +
  theme_minimal() +
  labs(title = 'Newborn weight by hospital and delivery type',
       x = 'Hospital',
       y = 'Weight (g)',
       fill = 'Delivery type')

The graph shows the weight of newborns by hospital and type of birth (caesarean and natural), highlighting a uniform distribution both between hospitals and between types of birth.

Analysis and Modeling

chi_quadro = chisq.test(Ospedale,Tipo.parto)
chi_quadro

## 
##  Pearson's Chi-squared test
## 
## data:  Ospedale and Tipo.parto
## X-squared = 1.0972, df = 2, p-value = 0.5778

The chi-square test for the variables Hospital and Type.delivery shows a p-value of 0.5778, indicating that there is no significant statistical relationship between these two variables in the sample.

The average weight and length of this sample of newborns are significantly equal to those of the population?

Mean of the population: - Weight: M=3450 g , F=3300 g. - Length: M= 500 cm, F= 4.90 cm.

mu_peso_M = 3450
maschi = dati %>% filter(Sesso=='M')

t.test(x=maschi$Peso,
       mu = mu_peso_M)

## 
##  One Sample t-test
## 
## data:  maschi$Peso
## t = -2.9845, df = 1243, p-value = 0.002896
## alternative hypothesis: true mean is not equal to 3450
## 95 percent confidence interval:
##  3380.748 3435.683
## sample estimates:
## mean of x 
##  3408.215

mu_peso_F = 3300
femmine = dati %>% filter(Sesso=='F')

t.test(x=femmine$Peso,
       mu = mu_peso_F)

## 
##  One Sample t-test
## 
## data:  femmine$Peso
## t = -9.3509, df = 1255, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
##  3131.997 3190.267
## sample estimates:
## mean of x 
##  3161.132

The null hypothesis is rejected for both genders for weight

mu_lunghezza_M = 500
maschi = dati %>% filter(Sesso=='M')

t.test(x=maschi$Lunghezza,
       mu = mu_lunghezza_M)

## 
##  One Sample t-test
## 
## data:  maschi$Lunghezza
## t = -0.4883, df = 1243, p-value = 0.6254
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
##  498.3301 501.0043
## sample estimates:
## mean of x 
##  499.6672

mu_lunghezza_F = 490
femmine = dati %>% filter(Sesso=='F')

t.test(x=femmine$Lunghezza,
       mu = mu_lunghezza_F)

## 
##  One Sample t-test
## 
## data:  femmine$Lunghezza
## t = -0.30334, df = 1255, p-value = 0.7617
## alternative hypothesis: true mean is not equal to 490
## 95 percent confidence interval:
##  488.2401 491.2885
## sample estimates:
## mean of x 
##  489.7643

The null hypothesis is not rejected for both sexes for the length

Student t tests for the length variable do not reject the null hypothesis, therefore the mean of the sample under examination is not significantly different from that of the population. For the t tests performed on the weight, the null hypothesis is rejected, therefore the means are significantly different from those taken as reference for the population.

Are anthropometric measurements significantly different between the two genders?

t.test(maschi$Peso, femmine$Peso)

## 
##  Welch Two Sample t-test
## 
## data:  maschi$Peso and femmine$Peso
## t = 12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  207.0615 287.1051
## sample estimates:
## mean of x mean of y 
##  3408.215  3161.132

The means of the two groups are: - Male mean: 3408.215 - Female mean: 3161.132 The difference between the means is significant, with males tending to have a higher average weight than females in the sample.

t.test(maschi$Lunghezza, femmine$Lunghezza)

## 
##  Welch Two Sample t-test
## 
## data:  maschi$Lunghezza and femmine$Lunghezza
## t = 9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.876273 11.929470
## sample estimates:
## mean of x mean of y 
##  499.6672  489.7643

The means of the two groups are: - Male mean:499.6672 - Female mean: 489.7643 The difference between the means is significant, with males tending to have a greater mean length than females in the sample.

t.test(maschi$Cranio, femmine$Cranio)

## 
##  Welch Two Sample t-test
## 
## data:  maschi$Cranio and femmine$Cranio
## t = 7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.541270 6.089912
## sample estimates:
## mean of x mean of y 
##  342.4486  337.6330

The means of the two groups are: - Male mean: 342.4486 - Female mean: 337.6330 The difference between the means is significant, with males tending to have a larger skull diameter than females in the sample.

Creating the Regression Model

par(mfrow=c(2,2), mar=c(4,4,2,1))

boxplot(Peso ~ dati$Fumatrici)
boxplot(Peso ~ Ospedale)
boxplot(Peso ~ Tipo.parto)
boxplot(Peso ~ Sesso)

t.test(Peso ~ Fumatrici, data = dati)

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
##  -45.61354 145.22674
## sample estimates:
## mean in group N mean in group Y 
##        3286.153        3236.346

t.test(Peso ~ Sesso, data = dati)

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
##  -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M 
##        3161.132        3408.215

t.test(Peso ~ Tipo.parto, data = dati)

## 
##  Welch Two Sample t-test
## 
## data:  Peso by Tipo.parto
## t = -0.12968, df = 1493, p-value = 0.8968
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
##  -46.27992  40.54037
## sample estimates:
## mean in group Ces mean in group Nat 
##          3282.047          3284.916

The variable ‘sesso’ appears to be the most correlated with the weight variable of the categorical variables, with p-value < 2.2e-16.

dati_num=dati %>% select_if(is.numeric)
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- (cor(x, y))
  txt <- format(c(r, 1), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = 1.5)
}
pairs(dati_num,lower.panel=panel.cor, upper.panel=panel.smooth)

## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico

To create the correlation matrix, I only considered numerical variables, since correlation is a measure that applies to quantitative data. The strongest correlations show that weight is closely related to length (r = 0.8) and skull size (r = 0.7), suggesting a proportional growth of the newborn. A moderate correlation (r = 0.59) is also found with gestation weeks, indicating that a longer gestation tends to increase weight, although the relationship is not perfectly linear. On the other hand, weak or zero correlations suggest that factors such as maternal age (r = -0.024) and the number of previous pregnancies (r = 0.0024) do not significantly influence newborn weight, suggesting that these aspects are less decisive in this context.

I performed a multivariate linear regression analysis to study the factors that influence newborn weight.

mod = lm(Peso ~. , data=dati) # I created a complete template (mod) that includes all the variables
summary(mod)

## 
## Call:
## lm(formula = Peso ~ ., data = dati)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1123.3  -181.2   -14.6   160.7  2612.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6735.1677   141.3977 -47.633  < 2e-16 ***
## Anni.madre        0.7983     1.1463   0.696   0.4862    
## N.gravidanze     11.4118     4.6665   2.445   0.0145 *  
## FumatriciY      -30.1567    27.5396  -1.095   0.2736    
## Gestazione       32.5265     3.8179   8.520  < 2e-16 ***
## Lunghezza        10.2951     0.3007  34.237  < 2e-16 ***
## Cranio           10.4725     0.4261  24.580  < 2e-16 ***
## Tipo.partoNat    29.5027    12.0848   2.441   0.0147 *  
## Ospedaleosp2    -11.2216    13.4388  -0.835   0.4038    
## Ospedaleosp3     28.0984    13.4972   2.082   0.0375 *  
## SessoM           77.5473    11.1779   6.938 5.07e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.9 on 2489 degrees of freedom
## Multiple R-squared:  0.7289, Adjusted R-squared:  0.7278 
## F-statistic: 669.1 on 10 and 2489 DF,  p-value: < 2.2e-16

stepwise.mod = MASS::stepAIC(mod,
              direction = "both",
              k=log(n)) # I used the stepwise.mod function to perform a variable selection using the AIC criterion with BIC correction (k=log(n)).

## Start:  AIC=28139.46
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza + 
##     Cranio + Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Anni.madre    1     36392 186809099 28132
## - Fumatrici     1     89979 186862686 28133
## - Ospedale      2    686397 187459103 28133
## - Tipo.parto    1    447233 187219939 28138
## - N.gravidanze  1    448762 187221469 28138
## <none>                      186772707 28140
## - Sesso         1   3611588 190384294 28180
## - Gestazione    1   5446558 192219264 28204
## - Cranio        1  45338051 232110758 28675
## - Lunghezza     1  87959790 274732497 29096
## 
## Step:  AIC=28132.12
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio + 
##     Tipo.parto + Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Fumatrici     1     90897 186899996 28126
## - Ospedale      2    692738 187501837 28126
## - Tipo.parto    1    448222 187257321 28130
## <none>                      186809099 28132
## - N.gravidanze  1    633756 187442855 28133
## + Anni.madre    1     36392 186772707 28140
## - Sesso         1   3618736 190427835 28172
## - Gestazione    1   5412879 192221978 28196
## - Cranio        1  45588236 232397335 28670
## - Lunghezza     1  87950050 274759149 29089
## 
## Step:  AIC=28125.51
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Ospedale + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Ospedale      2    701680 187601677 28119
## - Tipo.parto    1    440684 187340680 28124
## <none>                      186899996 28126
## - N.gravidanze  1    610840 187510837 28126
## + Fumatrici     1     90897 186809099 28132
## + Anni.madre    1     37311 186862686 28133
## - Sesso         1   3602797 190502794 28165
## - Gestazione    1   5346781 192246777 28188
## - Cranio        1  45632149 232532146 28664
## - Lunghezza     1  88355030 275255027 29086
## 
## Step:  AIC=28119.23
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto + 
##     Sesso
## 
##                Df Sum of Sq       RSS   AIC
## - Tipo.parto    1    463870 188065546 28118
## <none>                      187601677 28119
## - N.gravidanze  1    651066 188252743 28120
## + Ospedale      2    701680 186899996 28126
## + Fumatrici     1     99840 187501837 28126
## + Anni.madre    1     43845 187557831 28127
## - Sesso         1   3649259 191250936 28160
## - Gestazione    1   5444109 193045786 28183
## - Cranio        1  45758101 233359778 28657
## - Lunghezza     1  88054432 275656108 29074
## 
## Step:  AIC=28117.58
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Sesso
## 
##                Df Sum of Sq       RSS   AIC
## <none>                      188065546 28118
## - N.gravidanze  1    623141 188688687 28118
## + Tipo.parto    1    463870 187601677 28119
## + Ospedale      2    724866 187340680 28124
## + Fumatrici     1     91892 187973654 28124
## + Anni.madre    1     45044 188020502 28125
## - Sesso         1   3655292 191720838 28158
## - Gestazione    1   5464853 193530399 28181
## - Cranio        1  46108583 234174130 28658
## - Lunghezza     1  87632762 275698308 29066

The final selected model includes only 5 significant variables: Number of pregnancies (p = 0.004) Gestation (p < 0.001) Length (p < 0.001) Skull diameter (p < 0.001) Sex (p < 0.001)

summary(stepwise.mod)

## 
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + 
##     Sesso, data = dati)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1149.44  -180.81   -15.58   163.64  2639.72 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6681.1445   135.7229 -49.226  < 2e-16 ***
## N.gravidanze    12.4750     4.3396   2.875  0.00408 ** 
## Gestazione      32.3321     3.7980   8.513  < 2e-16 ***
## Lunghezza       10.2486     0.3006  34.090  < 2e-16 ***
## Cranio          10.5402     0.4262  24.728  < 2e-16 ***
## SessoM          77.9927    11.2021   6.962 4.26e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.6 on 2494 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7265 
## F-statistic:  1328 on 5 and 2494 DF,  p-value: < 2.2e-16

vif(stepwise.mod)

## N.gravidanze   Gestazione    Lunghezza       Cranio        Sesso 
##     1.023475     1.669189     2.074689     1.624465     1.040054

All these values are >5, which suggests that there is no multicollinearity problem in the model.

Model Quality Analysis

# to see the residuals and outliers
par(mfrow=c(2,2))
plot(stepwise.mod)

Plot of residuals vs. fitted: Residuals are randomly distributed around zero. Q-Q Residuals plot: Points roughly follow the diagonal line, residuals are normally distributed. Data 1551 appears as the only influential value in the model.

# leverage
lev = hatvalues(stepwise.mod)
plot(lev)
p = sum(lev)
soglia = 2*p/n
abline(h=soglia, col=2)

Many points have low leverage, but some have higher values (>0.02).

bptest(stepwise.mod)

## 
##  studentized Breusch-Pagan test
## 
## data:  stepwise.mod
## BP = 90.253, df = 5, p-value < 2.2e-16

dwtest(stepwise.mod)

## 
##  Durbin-Watson test
## 
## data:  stepwise.mod
## DW = 1.9535, p-value = 0.1224
## alternative hypothesis: true autocorrelation is greater than 0

shapiro.test(residuals(stepwise.mod))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(stepwise.mod)
## W = 0.97408, p-value < 2.2e-16

Breusch-Pagan Test (p < 0.05): We reject the null hypothesis of homoscedasticity. This suggests that the variance of the errors is not constant Durbin-Watson Test (p > 0.05): We cannot reject the null hypothesis that there is no autocorrelation in the residuals. This suggests that the residuals are independent of each other. Shapiro-Wilk Test (p < 0.05): We reject the null hypothesis that the residuals are normally distributed.

plot(density(residuals(stepwise.mod)))

This plot shows the density of the residuals, which appears normal with: - A symmetric bell shape - Centered around zero - Slightly fatter tails - Bandwidth of 48.38

#outliers
plot(rstudent(stepwise.mod))
abline(h=c(-2,2))

car::outlierTest(stepwise.mod)

##       rstudent unadjusted p-value Bonferroni p
## 1551 10.051908         2.4906e-23   6.2265e-20
## 155   5.027798         5.3138e-07   1.3285e-03
## 1306  4.827238         1.4681e-06   3.6702e-03

Most residuals are between +2 and -2 standard deviations, but there are some outliers.

#distanza di cook
cook=cooks.distance(stepwise.mod)
plot(cook,ylim = c(0,1))

Most observations have little influence on the model, except one point that has a major influence.

Predictions and Results

Estimate the weight of a newborn considering: - SEX: F - GESTATION: 39 weeks - N. PREGNANCIES: 2 (this will be the 3rd) Since the variables length and diameter of the skull were not mentioned, I calculated the average of both.

cranio_mean = mean(Cranio)
lunghezza_mean = mean(Lunghezza)
round(cranio_mean,1)

## [1] 340

round(lunghezza_mean,0)

## [1] 495

data = data_frame(Cranio=340, Lunghezza= 495, Gestazione=39, N.gravidanze=2, Sesso='F')

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

peso_predict = predict(stepwise.mod, data)
round(peso_predict,0)

##    1 
## 3261

The model is for a 3,261g newborn

Visualization

ggplot(dati, aes(x = Cranio, y = Peso)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red') +
  theme_minimal() +
  labs(title = 'Relationship between weight and skull diameter of newborn',
       x = 'Skull diameter',
       y = 'Weight (g)')

## `geom_smooth()` using formula = 'y ~ x'

The weight-skull diameter relationship shows a positive correlation between weight and skull diameter, with a fairly small dispersion of points around the trend line. The relationship appears rather linear and the values are mainly concentrated between 320-350 mm of skull diameter, although there are some outliers, especially in the upper part of the graph, referring to heavier newborns.

ggplot(dati, aes(x = Lunghezza, y = Peso)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red') +
  theme_minimal() +
  labs(title = 'Relationship between weight and length of newborn',
       x = 'Length',
       y = 'Weight (g)')

## `geom_smooth()` using formula = 'y ~ x'

The weight-length relationship also shows a strong positive correlation, but with a more marked vertical dispersion of points than in the previous graph. Most newborns are concentrated between 450-500 mm in length and the relationship appears slightly curvilinear, with a tendency to flatten for greater lengths.

 ggplot(dati, aes(x = Gestazione, y = Peso)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'red') +
  labs(title = 'Relationship between weeks of gestation and birth weight',
        x = 'Weeks of gestation',
        y = 'Weight (g)')+
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Most births occur between 35 and 40 weeks. Preterm infants, those with a gestation period of less than 35 weeks, tend to have lower birth weights and less variability than full-term infants. For full-term infants, born between 37 and 42 weeks, there is a large variability in weight, ranging from about 2000 g to almost 5000 g. The red trend line shows the average relationship, but the increasing dispersion of the points suggests that factors other than gestation length influence birth weight.

All three variables (skull diameter, length and weeks of gestation) show a clear positive relationship with birth weight. This confirms that they are all relevant indicators for newborn weight.

Statistical Model for Predicting Neonatal Weight

Sara Greco