getwd()
## [1] "C:/Users/PC/Desktop"
setwd("C:/Users/PC/Desktop")
library(moments)
library(ggplot2)
dati = read.csv("neonati.csv")
attach(dati)
n = nrow(dati)
the dataset is composed of 2500 objects and 10 variables:
Anni.madre: continuous quantitative variable
N. gravidanze: discrete quantitative variable
Fumatrici: dichotomous qualitative variable (coded in dummy, 0=non-smoker, 1=smoker)
Gestazione: continuous quantitative variable
Peso: continuous quantitative variable
Lunghezza: continuous quantitative variable
Cranio: continuous quantitative variable
Tipo.parto: nominal qualitative variable
Ospedale: nominal qualitative variable
Sesso: dichotomous qualitative variable
Anni.madre (Mother’s age): There are two errors in the mother’s age in the dataset, indicated as 0 and 1. We can correct them by replacing them with the median of the age, calculated without considering these values.
mediana = median(dati$Anni.madre[!(dati$Anni.madre%in% c(0,1))], na.rm= T) #in this case with the value 28
dati$Anni.madre[dati$Anni.madre %in% c(0, 1)] = mediana
dati$Fumatrici = factor(Fumatrici, levels = c(0,1), labels = c('N', 'Y')) # 0= N, no smoker; 1= Y, smoker
gini.index = function(x){
ni = table(x)
fi = ni/length(x)
fi2 = fi^2
J= length(table(x))
gini = 1-sum(fi2)
gini.norm = gini/((J-1)/J)
return(gini.norm)
} # gini function
for the numeric variables I calculate the main indices
index_calculation = function(x) {
round(c(mean = mean(x),
median = median(x),
devst = sd(x),
min = min(x),
max = max(x),
Q1 = quantile(x, 0.25),
Q3 = quantile(x, 0.75),
IQR = IQR(x),
kurtosis = kurtosis(x),
skewness = skewness(x),
cv = sd(x)/mean(x)*100,
gini = gini.index(x)
),2)
}
var_quant = c('Anni.madre', 'N.gravidanze', 'Gestazione', 'Peso','Lunghezza', 'Cranio')
quant = sapply(dati[var_quant], index_calculation)
kable(quant)
| Anni.madre | N.gravidanze | Gestazione | Peso | Lunghezza | Cranio | |
|---|---|---|---|---|---|---|
| mean | 28.19 | 0.98 | 38.98 | 3284.08 | 494.69 | 340.03 |
| median | 28.00 | 1.00 | 39.00 | 3300.00 | 500.00 | 340.00 |
| devst | 5.22 | 1.28 | 1.87 | 525.04 | 26.32 | 16.43 |
| min | 13.00 | 0.00 | 25.00 | 830.00 | 310.00 | 235.00 |
| max | 46.00 | 12.00 | 43.00 | 4930.00 | 565.00 | 390.00 |
| Q1.25% | 25.00 | 0.00 | 38.00 | 2990.00 | 480.00 | 330.00 |
| Q3.75% | 32.00 | 1.00 | 40.00 | 3620.00 | 510.00 | 350.00 |
| IQR | 7.00 | 1.00 | 2.00 | 630.00 | 30.00 | 20.00 |
| kurtosis | 2.90 | 13.99 | 11.26 | 5.03 | 9.49 | 5.95 |
| skewness | 0.15 | 2.51 | -2.07 | -0.65 | -1.51 | -0.79 |
| cv | 18.50 | 130.51 | 4.79 | 15.99 | 5.32 | 4.83 |
| gini | 0.97 | 0.73 | 0.85 | 1.00 | 0.94 | 0.97 |
Analysis of quantitative variables:
Anni.madre: The mean is 28 years. 50% of the mothers’ age is composed of an age ranging from 25 (Q1) to 32 (Q3) years. The minimum age is 13 years and the maximum age is 46 years. It has a symmetric distribution (skewness = 0.15).
N. gravidanze: The mean is 1 pregnancy. 50% of the no. of pregnancies is between 0 (Q1) and 1 (Q3). The minimum no. of pregnancies is 0 and the maximum is 12. The distribution is leptokurtic, and is strongly skewed, with a long tail towards higher values. The very high CV suggests that the number of pregnancies has extreme variability. This is consistent with the high kurtosis and strong skewness of the distribution, where most women have few pregnancies.
Gestazione: The mean is 39 weeks of gestation. 50% is composed of 38 (Q1) to 40 (Q3) weeks. Furthermore, the minimum value is 25 weeks and the maximum value is 43. The distribution is leptokurtic, it is strongly skewed towards lower values (-2.07), indicating that most pregnancies have a duration close to the mean, with a long tail towards shorter duration. The relatively low CV indicates that the duration of pregnancies is quite stable and concentrated around the mean.
Peso: The mean is 3,284kg. 50% of newborn weight is between 2,999kg (Q1) and 3,620kg (Q3). The minimum weight is 830g and the maximum weight is 4,930kg. The distribution is leptokurtic, and has a negative skewness (-0.65), indicating a longer tail on the left.
Lunghezza: The mean is 494mm. 50% of newborns have a length that is between 480mm (Q1) and 510mm (Q3). The minimum length is 310mm and the maximum is 565mm. Here too the distribution is concentrated around the mean. The distribution is asymmetric and leptokurtic.
Cranio: The mean of the skull circumference is 340. 50% of the skull circumference of newborns is between 235 (Q1) and 390 (Q3). The distribution shows some concentration around the mean value, with some extreme values. The distribution is asymmetric and leptokurtic.
par(mfrow=c(2,3), mar=c(4,4,2,1))
hist(Anni.madre, col = 'lightblue', main = "Mother's Age Distribution",
xlab = "Mother's Age", ylab = 'Frequency', border= 'black')
hist(N.gravidanze, col = 'lightcoral', main = 'N. Pregnancies Distribution',
xlab = 'N. Pregnancies', ylab = 'Frequency', border= 'black')
hist(Gestazione, col = 'lightgreen', main = 'Gestation Duration',
xlab = 'Weeks of Gestation', ylab = 'Frequency', border= 'black')
hist(Peso, col = 'lightyellow', main = 'Newborn Weight (g)',
xlab = 'Weight (g)', ylab = 'Frequency', border= 'black')
hist(Lunghezza, col = 'lightpink', main = 'Length of newborns',
xlab = 'Length (cm)', ylab = 'Frequency', border= 'black')
hist(Cranio, col = 'lightgray', main = 'Skull diameter',
xlab = 'Skull diameter of newborns (cm)', ylab = 'Frequency', border= 'black')
Everything we described above is displayed in the graphs.
same thing for categorical variables
cat_calculation = function(x){
round(c(mode = table(x),
length = length(x),
gini = gini.index(x)
),2)
}
var_qual = c('Fumatrici', 'Tipo.parto', 'Ospedale', 'Sesso')
qual = sapply(dati[var_qual], cat_calculation)
kable(qual)
|
|
|
|
Fumatrici: The majority of mothers are included in the ‘non-smokers’ category.
Tipo.parto: The majority of mothers who gave birth had a Natural birth.
Ospedale: The majority of mothers chose hospital 2 to give birth, although the difference compared to hospitals 1 and 3 is minimal. The Gini index, equal to 1, indicates maximum heterogeneity.
Sesso: For the variable “Sex”, the majority of mothers had a female child (1256), compared to male children (1244). The Gini index, equal to 1, reflects a very balanced distribution between the two categories, indicating an almost perfect parity between the number of male and female children, but not total homogeneity.
par(mfrow=c(2,2), mar=c(4,4,2,1))
barplot(table(dati$Fumatrici), col = 'lightblue', main = 'Smoking and no smoking mothers',
ylab = 'Frequency', border= 'black')
barplot(table(Tipo.parto), col = 'lightcoral', main = 'Delivery type',
ylab = 'Frequency', border= 'black')
barplot(table(Ospedale), col = 'lightgreen', main = 'Hospital',
ylab = 'Frequency', border= 'black')
barplot(table(Sesso), col = 'lightyellow', main = 'Gender',
ylab = 'Frequency', border= 'black')
Everything we described above is displayed in the graphs.
ggplot(dati, aes(x = Fumatrici, y = Peso)) +
geom_boxplot(fill = 'lightblue') +
labs(title = 'Impact of smoking on newborn weight',
x = 'Smoking during pregnancy',
y = 'Weight (g)') +
theme_minimal()
Smoking during pregnancy appears to have a minimal impact on infant weight, with a slight difference between mothers who smoke and those who do not smoke. Mothers who do not smoke tend to have infants with a slightly higher weight, although the difference is not particularly marked, but it is noticeable. In both groups, outliers are observed, both with weights higher and lower than the average.
ggplot(dati, aes(x = Gestazione, y = Peso, fill = Fumatrici)) +
geom_boxplot() +
scale_fill_discrete(name = "Smoking during pregnancy") +
labs(
title = 'Impact of smoking on newborn weight',
x = 'Weeks of gestation',
y = 'Newborn weight (g)'
) +
theme_minimal() +
theme(
legend.position = 'bottom',
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = 'gray90'),
strip.text = element_text(face = 'bold')
)
The graph shows that smoking during pregnancy may reduce the weight of the newborn, with a non-significant difference. Newborns of mothers who do not smoke have a slightly higher median weight, while those of mothers who smoke tend to have lower weights.
ggplot(dati, aes(x = as.factor(N.gravidanze), y = Peso)) +
geom_boxplot(fill = 'lightpink') +
labs(title = 'Newborn weight in relation to number of pregnancies',
x = 'Number of pregnancies',
y = 'Weight (g)') +
theme_minimal()
The relationship between the number of pregnancies and newborn weight does not show a clear trend. The median newborn weight remains relatively stable until the fifth pregnancy.
ggplot(dati, aes(x = Tipo.parto, y = Peso)) +
geom_boxplot(fill = "lightgreen") +
theme_minimal() +
labs(title = "Newborn weight by type of delivery",
x = "Type of delivery",
y = "Weight (g)")
Comparison of the type of delivery (caesarean and natural) and the weight of the newborn does not reveal significant differences. The medians of the weights are almost identical and the distributions are very similar in the two groups. In both delivery methods, outliers are observed, indicating extreme values of weight, but without a significant difference between the two groups.
ggplot(dati, aes(x = Anni.madre, y = Peso)) +
geom_point() +
geom_smooth(method = 'lm', color = 'blue') +
theme_minimal() +
labs(title = "Relationship between newborn weight and mother's age",
x = "Mother's age",
y = 'Weight (g)')
## `geom_smooth()` using formula = 'y ~ x'
The analysis of the relationship between maternal age and newborn weight does not show a clear correlation. The blue regression line is almost horizontal, indicating the absence of a significant trend. The points are widely dispersed around the line, showing a large variability in weight for all maternal ages.
ggplot(dati, aes(x = Ospedale, y = Peso, fill = Tipo.parto)) +
geom_col(position = 'dodge') +
theme_minimal() +
labs(title = 'Newborn weight by hospital and delivery type',
x = 'Hospital',
y = 'Weight (g)',
fill = 'Delivery type')
The graph shows the weight of newborns by hospital and type of birth (caesarean and natural), highlighting a uniform distribution both between hospitals and between types of birth.
chi_quadro = chisq.test(Ospedale,Tipo.parto)
chi_quadro
##
## Pearson's Chi-squared test
##
## data: Ospedale and Tipo.parto
## X-squared = 1.0972, df = 2, p-value = 0.5778
The chi-square test for the variables Hospital and Type.delivery shows a p-value of 0.5778, indicating that there is no significant statistical relationship between these two variables in the sample.
Mean of the population: - Weight: M=3450 g , F=3300 g. - Length: M= 500 cm, F= 4.90 cm.
mu_peso_M = 3450
maschi = dati %>% filter(Sesso=='M')
t.test(x=maschi$Peso,
mu = mu_peso_M)
##
## One Sample t-test
##
## data: maschi$Peso
## t = -2.9845, df = 1243, p-value = 0.002896
## alternative hypothesis: true mean is not equal to 3450
## 95 percent confidence interval:
## 3380.748 3435.683
## sample estimates:
## mean of x
## 3408.215
mu_peso_F = 3300
femmine = dati %>% filter(Sesso=='F')
t.test(x=femmine$Peso,
mu = mu_peso_F)
##
## One Sample t-test
##
## data: femmine$Peso
## t = -9.3509, df = 1255, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 3300
## 95 percent confidence interval:
## 3131.997 3190.267
## sample estimates:
## mean of x
## 3161.132
The null hypothesis is rejected for both genders for weight
mu_lunghezza_M = 500
maschi = dati %>% filter(Sesso=='M')
t.test(x=maschi$Lunghezza,
mu = mu_lunghezza_M)
##
## One Sample t-test
##
## data: maschi$Lunghezza
## t = -0.4883, df = 1243, p-value = 0.6254
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
## 498.3301 501.0043
## sample estimates:
## mean of x
## 499.6672
mu_lunghezza_F = 490
femmine = dati %>% filter(Sesso=='F')
t.test(x=femmine$Lunghezza,
mu = mu_lunghezza_F)
##
## One Sample t-test
##
## data: femmine$Lunghezza
## t = -0.30334, df = 1255, p-value = 0.7617
## alternative hypothesis: true mean is not equal to 490
## 95 percent confidence interval:
## 488.2401 491.2885
## sample estimates:
## mean of x
## 489.7643
The null hypothesis is not rejected for both sexes for the length
Student t tests for the length variable do not reject the null hypothesis, therefore the mean of the sample under examination is not significantly different from that of the population. For the t tests performed on the weight, the null hypothesis is rejected, therefore the means are significantly different from those taken as reference for the population.
t.test(maschi$Peso, femmine$Peso)
##
## Welch Two Sample t-test
##
## data: maschi$Peso and femmine$Peso
## t = 12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 207.0615 287.1051
## sample estimates:
## mean of x mean of y
## 3408.215 3161.132
The means of the two groups are: - Male mean: 3408.215 - Female mean: 3161.132 The difference between the means is significant, with males tending to have a higher average weight than females in the sample.
t.test(maschi$Lunghezza, femmine$Lunghezza)
##
## Welch Two Sample t-test
##
## data: maschi$Lunghezza and femmine$Lunghezza
## t = 9.582, df = 2459.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.876273 11.929470
## sample estimates:
## mean of x mean of y
## 499.6672 489.7643
The means of the two groups are: - Male mean:499.6672 - Female mean: 489.7643 The difference between the means is significant, with males tending to have a greater mean length than females in the sample.
t.test(maschi$Cranio, femmine$Cranio)
##
## Welch Two Sample t-test
##
## data: maschi$Cranio and femmine$Cranio
## t = 7.4102, df = 2491.4, p-value = 1.718e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.541270 6.089912
## sample estimates:
## mean of x mean of y
## 342.4486 337.6330
The means of the two groups are: - Male mean: 342.4486 - Female mean: 337.6330 The difference between the means is significant, with males tending to have a larger skull diameter than females in the sample.
Creating the Regression Model
par(mfrow=c(2,2), mar=c(4,4,2,1))
boxplot(Peso ~ dati$Fumatrici)
boxplot(Peso ~ Ospedale)
boxplot(Peso ~ Tipo.parto)
boxplot(Peso ~ Sesso)
t.test(Peso ~ Fumatrici, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Fumatrici
## t = 1.034, df = 114.1, p-value = 0.3033
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -45.61354 145.22674
## sample estimates:
## mean in group N mean in group Y
## 3286.153 3236.346
t.test(Peso ~ Sesso, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Sesso
## t = -12.106, df = 2490.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.1051 -207.0615
## sample estimates:
## mean in group F mean in group M
## 3161.132 3408.215
t.test(Peso ~ Tipo.parto, data = dati)
##
## Welch Two Sample t-test
##
## data: Peso by Tipo.parto
## t = -0.12968, df = 1493, p-value = 0.8968
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
## -46.27992 40.54037
## sample estimates:
## mean in group Ces mean in group Nat
## 3282.047 3284.916
The variable ‘sesso’ appears to be the most correlated with the weight variable of the categorical variables, with p-value < 2.2e-16.
dati_num=dati %>% select_if(is.numeric)
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- (cor(x, y))
txt <- format(c(r, 1), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = 1.5)
}
pairs(dati_num,lower.panel=panel.cor, upper.panel=panel.smooth)
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
## Warning in par(usr): l'argomento 1 non nomina un parametro grafico
To create the correlation matrix, I only considered numerical variables, since correlation is a measure that applies to quantitative data. The strongest correlations show that weight is closely related to length (r = 0.8) and skull size (r = 0.7), suggesting a proportional growth of the newborn. A moderate correlation (r = 0.59) is also found with gestation weeks, indicating that a longer gestation tends to increase weight, although the relationship is not perfectly linear. On the other hand, weak or zero correlations suggest that factors such as maternal age (r = -0.024) and the number of previous pregnancies (r = 0.0024) do not significantly influence newborn weight, suggesting that these aspects are less decisive in this context.
I performed a multivariate linear regression analysis to study the factors that influence newborn weight.
mod = lm(Peso ~. , data=dati) # I created a complete template (mod) that includes all the variables
summary(mod)
##
## Call:
## lm(formula = Peso ~ ., data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1123.3 -181.2 -14.6 160.7 2612.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6735.1677 141.3977 -47.633 < 2e-16 ***
## Anni.madre 0.7983 1.1463 0.696 0.4862
## N.gravidanze 11.4118 4.6665 2.445 0.0145 *
## FumatriciY -30.1567 27.5396 -1.095 0.2736
## Gestazione 32.5265 3.8179 8.520 < 2e-16 ***
## Lunghezza 10.2951 0.3007 34.237 < 2e-16 ***
## Cranio 10.4725 0.4261 24.580 < 2e-16 ***
## Tipo.partoNat 29.5027 12.0848 2.441 0.0147 *
## Ospedaleosp2 -11.2216 13.4388 -0.835 0.4038
## Ospedaleosp3 28.0984 13.4972 2.082 0.0375 *
## SessoM 77.5473 11.1779 6.938 5.07e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.9 on 2489 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7278
## F-statistic: 669.1 on 10 and 2489 DF, p-value: < 2.2e-16
stepwise.mod = MASS::stepAIC(mod,
direction = "both",
k=log(n)) # I used the stepwise.mod function to perform a variable selection using the AIC criterion with BIC correction (k=log(n)).
## Start: AIC=28139.46
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Anni.madre 1 36392 186809099 28132
## - Fumatrici 1 89979 186862686 28133
## - Ospedale 2 686397 187459103 28133
## - Tipo.parto 1 447233 187219939 28138
## - N.gravidanze 1 448762 187221469 28138
## <none> 186772707 28140
## - Sesso 1 3611588 190384294 28180
## - Gestazione 1 5446558 192219264 28204
## - Cranio 1 45338051 232110758 28675
## - Lunghezza 1 87959790 274732497 29096
##
## Step: AIC=28132.12
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Fumatrici 1 90897 186899996 28126
## - Ospedale 2 692738 187501837 28126
## - Tipo.parto 1 448222 187257321 28130
## <none> 186809099 28132
## - N.gravidanze 1 633756 187442855 28133
## + Anni.madre 1 36392 186772707 28140
## - Sesso 1 3618736 190427835 28172
## - Gestazione 1 5412879 192221978 28196
## - Cranio 1 45588236 232397335 28670
## - Lunghezza 1 87950050 274759149 29089
##
## Step: AIC=28125.51
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Ospedale 2 701680 187601677 28119
## - Tipo.parto 1 440684 187340680 28124
## <none> 186899996 28126
## - N.gravidanze 1 610840 187510837 28126
## + Fumatrici 1 90897 186809099 28132
## + Anni.madre 1 37311 186862686 28133
## - Sesso 1 3602797 190502794 28165
## - Gestazione 1 5346781 192246777 28188
## - Cranio 1 45632149 232532146 28664
## - Lunghezza 1 88355030 275255027 29086
##
## Step: AIC=28119.23
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Sesso
##
## Df Sum of Sq RSS AIC
## - Tipo.parto 1 463870 188065546 28118
## <none> 187601677 28119
## - N.gravidanze 1 651066 188252743 28120
## + Ospedale 2 701680 186899996 28126
## + Fumatrici 1 99840 187501837 28126
## + Anni.madre 1 43845 187557831 28127
## - Sesso 1 3649259 191250936 28160
## - Gestazione 1 5444109 193045786 28183
## - Cranio 1 45758101 233359778 28657
## - Lunghezza 1 88054432 275656108 29074
##
## Step: AIC=28117.58
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Sesso
##
## Df Sum of Sq RSS AIC
## <none> 188065546 28118
## - N.gravidanze 1 623141 188688687 28118
## + Tipo.parto 1 463870 187601677 28119
## + Ospedale 2 724866 187340680 28124
## + Fumatrici 1 91892 187973654 28124
## + Anni.madre 1 45044 188020502 28125
## - Sesso 1 3655292 191720838 28158
## - Gestazione 1 5464853 193530399 28181
## - Cranio 1 46108583 234174130 28658
## - Lunghezza 1 87632762 275698308 29066
The final selected model includes only 5 significant variables: Number of pregnancies (p = 0.004) Gestation (p < 0.001) Length (p < 0.001) Skull diameter (p < 0.001) Sex (p < 0.001)
summary(stepwise.mod)
##
## Call:
## lm(formula = Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio +
## Sesso, data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.44 -180.81 -15.58 163.64 2639.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.1445 135.7229 -49.226 < 2e-16 ***
## N.gravidanze 12.4750 4.3396 2.875 0.00408 **
## Gestazione 32.3321 3.7980 8.513 < 2e-16 ***
## Lunghezza 10.2486 0.3006 34.090 < 2e-16 ***
## Cranio 10.5402 0.4262 24.728 < 2e-16 ***
## SessoM 77.9927 11.2021 6.962 4.26e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.6 on 2494 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.7265
## F-statistic: 1328 on 5 and 2494 DF, p-value: < 2.2e-16
vif(stepwise.mod)
## N.gravidanze Gestazione Lunghezza Cranio Sesso
## 1.023475 1.669189 2.074689 1.624465 1.040054
All these values are >5, which suggests that there is no multicollinearity problem in the model.
Model Quality Analysis
# to see the residuals and outliers
par(mfrow=c(2,2))
plot(stepwise.mod)
Plot of residuals vs. fitted: Residuals are randomly distributed around zero. Q-Q Residuals plot: Points roughly follow the diagonal line, residuals are normally distributed. Data 1551 appears as the only influential value in the model.
# leverage
lev = hatvalues(stepwise.mod)
plot(lev)
p = sum(lev)
soglia = 2*p/n
abline(h=soglia, col=2)
Many points have low leverage, but some have higher values (>0.02).
bptest(stepwise.mod)
##
## studentized Breusch-Pagan test
##
## data: stepwise.mod
## BP = 90.253, df = 5, p-value < 2.2e-16
dwtest(stepwise.mod)
##
## Durbin-Watson test
##
## data: stepwise.mod
## DW = 1.9535, p-value = 0.1224
## alternative hypothesis: true autocorrelation is greater than 0
shapiro.test(residuals(stepwise.mod))
##
## Shapiro-Wilk normality test
##
## data: residuals(stepwise.mod)
## W = 0.97408, p-value < 2.2e-16
Breusch-Pagan Test (p < 0.05): We reject the null hypothesis of homoscedasticity. This suggests that the variance of the errors is not constant Durbin-Watson Test (p > 0.05): We cannot reject the null hypothesis that there is no autocorrelation in the residuals. This suggests that the residuals are independent of each other. Shapiro-Wilk Test (p < 0.05): We reject the null hypothesis that the residuals are normally distributed.
plot(density(residuals(stepwise.mod)))
This plot shows the density of the residuals, which appears normal with: - A symmetric bell shape - Centered around zero - Slightly fatter tails - Bandwidth of 48.38
#outliers
plot(rstudent(stepwise.mod))
abline(h=c(-2,2))
car::outlierTest(stepwise.mod)
## rstudent unadjusted p-value Bonferroni p
## 1551 10.051908 2.4906e-23 6.2265e-20
## 155 5.027798 5.3138e-07 1.3285e-03
## 1306 4.827238 1.4681e-06 3.6702e-03
Most residuals are between +2 and -2 standard deviations, but there are some outliers.
#distanza di cook
cook=cooks.distance(stepwise.mod)
plot(cook,ylim = c(0,1))
Most observations have little influence on the model, except one point that has a major influence.
Estimate the weight of a newborn considering: - SEX: F - GESTATION: 39 weeks - N. PREGNANCIES: 2 (this will be the 3rd) Since the variables length and diameter of the skull were not mentioned, I calculated the average of both.
cranio_mean = mean(Cranio)
lunghezza_mean = mean(Lunghezza)
round(cranio_mean,1)
## [1] 340
round(lunghezza_mean,0)
## [1] 495
data = data_frame(Cranio=340, Lunghezza= 495, Gestazione=39, N.gravidanze=2, Sesso='F')
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
peso_predict = predict(stepwise.mod, data)
round(peso_predict,0)
## 1
## 3261
The model is for a 3,261g newborn
ggplot(dati, aes(x = Cranio, y = Peso)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red') +
theme_minimal() +
labs(title = 'Relationship between weight and skull diameter of newborn',
x = 'Skull diameter',
y = 'Weight (g)')
## `geom_smooth()` using formula = 'y ~ x'
The weight-skull diameter relationship shows a positive correlation between weight and skull diameter, with a fairly small dispersion of points around the trend line. The relationship appears rather linear and the values are mainly concentrated between 320-350 mm of skull diameter, although there are some outliers, especially in the upper part of the graph, referring to heavier newborns.
ggplot(dati, aes(x = Lunghezza, y = Peso)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red') +
theme_minimal() +
labs(title = 'Relationship between weight and length of newborn',
x = 'Length',
y = 'Weight (g)')
## `geom_smooth()` using formula = 'y ~ x'
The weight-length relationship also shows a strong positive correlation, but with a more marked vertical dispersion of points than in the previous graph. Most newborns are concentrated between 450-500 mm in length and the relationship appears slightly curvilinear, with a tendency to flatten for greater lengths.
ggplot(dati, aes(x = Gestazione, y = Peso)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red') +
labs(title = 'Relationship between weeks of gestation and birth weight',
x = 'Weeks of gestation',
y = 'Weight (g)')+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Most births occur between 35 and 40 weeks. Preterm infants, those with a gestation period of less than 35 weeks, tend to have lower birth weights and less variability than full-term infants. For full-term infants, born between 37 and 42 weeks, there is a large variability in weight, ranging from about 2000 g to almost 5000 g. The red trend line shows the average relationship, but the increasing dispersion of the points suggests that factors other than gestation length influence birth weight.
All three variables (skull diameter, length and weeks of gestation) show a clear positive relationship with birth weight. This confirms that they are all relevant indicators for newborn weight.