The dataset presents 2500 observations, for 10 variables: * Anni.madre: continuous quantitative variable ; * N.gravidanze: discrete quantitative variable ; * Fumatrice: Factor/dummy variable: “0” means non-smoker while “1” means smoker ; * Gestazione: continuous quantitative variable ; * Peso: continuous quantitative variable ; * Lunghezza: continuous quantitative variable ; * Cranio: continuous quantitative variable ; * Tipo.parto: qualitative variable on nominal scale with 2 modalities: “Ces” and “Nat” ; * Ospedale: qualitative variable on nominal scale with 3 modalities: “osp1”. “osp2” and “osp3” ; * Sesso: qualitative variable on nominal scale with 2 modalities: “F” and “M”.
In this study, “Peso” is the response variable, while the others are
predictor variables.
In particular “Lunghezza” and “Cranio” are
anthropometric data, used as control variables.
data <- read.csv("neonati.csv", sep=",", stringsAsFactors = T)
n<-dim(data)[1]
attach(data)
matrix_data <- matrix(nrow = 6, ncol = 6, byrow=F)
for (j in 1:6) {
matrix_data[,j] <- c(summary(Anni.madre)[j], summary(N.gravidanze)[j], summary(Gestazione)[j], summary(Peso)[j], summary(Lunghezza)[j], summary(Cranio)[j])
}
library(DT)
datatable(round(matrix_data, digits =2),
colnames = c("Min", "1st. Quart", "Median", "Mean", "3rd Quart", "Max"),
rownames = c("Anni.madre", "N.gravidanze", "Gestazione", "Peso", "Lunghezza", "Cranio")
)
The first variable, mother’s age, presents a minimum at 0 years old. To deeply investigate, a plot is done, in order to highlight eventual outliers/anomalies on low mother’s age value.
library(ggplot2)
ggplot(data,aes(x=1:n,y=Anni.madre,label=ifelse(Anni.madre<(quantile(Anni.madre, 0.25) - 1.5 * IQR(Anni.madre)), Anni.madre ,"")))+
geom_point(size=1,color = ifelse(Anni.madre < (quantile(Anni.madre, 0.25) - 1.5 * IQR(Anni.madre)), "red", "black"))+
geom_text(aes(colour="red"), vjust = 1.5, show.legend=F)+
labs(title="Mother's age with highlight on abnormal values", x="", y="Mother's age")+
theme_light()
It appears that among the 5 values identified as outliers, 2 are errors with ages of 0 and 1 year old.
These statistic units are removed from that database, as they can’t correspond to true values.
new_data <- data[ !(Anni.madre %in% c(0, 1)), ]
n<-dim(new_data)[1]
detach(data)
attach(new_data)
Going back to the recapitulative table of quantitative variables, there are no abnormal values in the min and max. To fully describe them, some plots are performed.
fi <- table(N.gravidanze)/n*100
barplot(fi,
main ="Number of previous pregnancies" ,
xlab ="Number of pregnancies" ,
ylab ="Relative frequencies, in %",
ylim = c(0,50),
cex.axis = 0.7,
cex.names= 0.7,
col= "mediumaquamarine")
For more than 40% of the mothers, it is a first pregnancy. The mother with major previous pregnancies had 12. The mean value is 0.98 pregnancy, very close value to 1. So, on average it is the second pregnancy.
fi <- table(Gestazione)/n*100
barplot(fi,
main ="Number of weeks of gestation" ,
xlab ="Number of weeks" ,
ylab ="Relative frequencies, in %",
ylim = c(0,40),
cex.axis = 0.7,
cex.names= 0.7,
col= "mediumaquamarine")
For nearly 30% of the births, it happens at the 40th week, which corresponds to the theoretical due date of the gestation. According to the summary table, the mean and median values are equal to 39 weeks. The minimum is 25 weeks, the maximum is 43 weeks. There is no abnormal value here.
For these 3 continuous quantitative variables, we plot here their densities. As many variables found in nature, we expect Gaussian distributions.
par(mfrow=c(1,3))
plot(density(Peso),
main = "Density of Weight",
xlab = "Weight, g",
ylab = "Density",
xlim = c(min(Peso), max(Peso)))
abline(v=median(Peso), col=2)
text(x=median(Peso)+600, y=0.0008, labels = median(Peso), col=2, cex = 1.3)
plot(density(Lunghezza),
main = "Density of Length",
xlab = "Length, cm",
ylab = "Density",
xlim = c(min(Lunghezza), max(Lunghezza)))
abline(v=median(Lunghezza), col=2)
text(x=median(Lunghezza)+30, y=0.02, labels = median(Lunghezza), col=2, cex = 1.3)
plot(density(Cranio),
main = "Density of Cranium diameter",
xlab = "Cranium diameter, mm",
ylab = "Density",
xlim = c(min(Cranio), max(Cranio)))
abline(v=median(Cranio), col=2)
text(x=median(Cranio)+25, y=0.026, labels = median(Cranio), col=2, cex = 1.3)
Shape factors:
library(moments)
matrix_shape <- round(matrix(c
(skewness(Peso), skewness(Lunghezza), skewness(Cranio),
kurtosis(Peso)-3, kurtosis(Lunghezza)-3, kurtosis(Cranio)-3),
nrow = 3, ncol = 2, byrow=F), 2)
datatable(matrix_shape,
colnames = c("Skewness", "Kurtosis"),
rownames = c("Weight", "Length", "Cranium diameter"))
Compared with Gaussian distributions, these 3 ones have negative skewnesses and are leptokurtic.
fi_S <- round(table(Fumatrici)/n, 2)*100
print(fi_S)
## Fumatrici
## 0 1
## 96 4
In this survey, 96% of the mothers are non-smoking, while 4% smoke.
fi_D <- round(table(Tipo.parto)/n, 2)*100
print(fi_D)
## Tipo.parto
## Ces Nat
## 29 71
71% of the mothers had a natural delivery, while 29% a C-section.
fi_H <- round(table(Ospedale)/n, 2)*100
print(fi_H)
## Ospedale
## osp1 osp2 osp3
## 33 34 33
There is absolute equality: the survey considered the same number of mothers in each of the 3 hospitals.
fi_Sex <- round(table(Sesso)/n, 2)*100
print(fi_Sex)
## Sesso
## F M
## 50 50
There is equality between girls and boys.
We want to know if the 2 qualitative variables “Ospedale” and “Tipo.parto” are independent. A Chi-squared test is performed.
table_Osp_Par <- table(Ospedale, Tipo.parto)
test_Osp_Par <- chisq.test(table_Osp_Par)
print(test_Osp_Par)
##
## Pearson's Chi-squared test
##
## data: table_Osp_Par
## X-squared = 1.083, df = 2, p-value = 0.5819
p-value is equal to 0.58, much higher than 5%, so we can’t refuse H0 hypothesis: there is independency of these 2 variables, so there is no hospital with major numbers of C-sections.
According to the Stanford Medicine Children’s Health website (https://www.stanfordchildrens.org/en/topic/default?id=newborn-measurements-90-P02673), the population mean values are: * Mean weight: 3.200kg * Mean length: 50 cm
mean(Peso);mean(Lunghezza)
## [1] 3284.184
## [1] 494.6958
To test if weight and length values of our sample are significantly different from the population ones, we perform a t.test.
t.test(Peso, mu=3200, conf.level=0.95, alternative = "two.sided")
##
## One Sample t-test
##
## data: Peso
## t = 8.0108, df = 2497, p-value = 1.731e-15
## alternative hypothesis: true mean is not equal to 3200
## 95 percent confidence interval:
## 3263.577 3304.791
## sample estimates:
## mean of x
## 3284.184
Even if population mean (3200 g) and sample mean (3284g) are close values, the t-test suggests that these data are significantly different (p-value very close to 0, the H0 hypothesis of equality is rejected).
t.test(Lunghezza, mu=500, conf.level=0.95, alternative = "two.sided")
##
## One Sample t-test
##
## data: Lunghezza
## t = -10.069, df = 2497, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 500
## 95 percent confidence interval:
## 493.6628 495.7287
## sample estimates:
## mean of x
## 494.6958
In the same way, the t-test here establishs a relevant difference between population mean (500 cm) and sample mean (495 cm), with a significativity level of 5%.
library(RColorBrewer)
ggplot(new_data)+
geom_boxplot(aes(x = Sesso, y = Cranio, fill = Sesso))+
labs(title = "Cranium diameter for M vs F",
x = " ",
y = "Cranium diameter, mm")+
theme_light()+
scale_fill_brewer(palette="Pastel1")
A t-test to compare group mean values is performed.
t.test(data = new_data, Cranio~Sesso)
##
## Welch Two Sample t-test
##
## data: Cranio by Sesso
## t = -7.4366, df = 2489.4, p-value = 1.414e-13
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -6.110504 -3.560417
## sample estimates:
## mean in group F mean in group M
## 337.6231 342.4586
The p-value is very low, so H0 hypothesis is rejected: there is relevant difference of cranium data between male and female.
ggplot(new_data)+
geom_boxplot(aes(x = Sesso, y = Lunghezza, fill = Sesso))+
labs(title = "Length for M vs F",
x = " ",
y = "Length, mm")+
theme_light()+
scale_fill_brewer(palette="Pastel1")
A t-test to compare group mean values is performed.
t.test(data = new_data, Lunghezza~Sesso)
##
## Welch Two Sample t-test
##
## data: Lunghezza by Sesso
## t = -9.5823, df = 2457.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -11.939001 -7.882672
## sample estimates:
## mean in group F mean in group M
## 489.7641 499.6750
As a conclusion, it appears that both cranium diameter and length are significantly different in respect of the newborn sex.
As preamble, we build a correlation matrix, together with the scatterplots of variables per pairs.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(new_data, upper.panel = panel.smooth, lower.panel = panel.cor)
First, we observe the correlations between the response variable weight and the other quantitative variables:
cor(Peso, Anni.madre)
## [1] -0.02378138
As the correlation coefficient is very close to 0, we can say that, based on these sample data, there is no correlation between weight and mother age. Nevertheless, as age is a control variable, and is usual to be considered as very important about pregnancy, at first, it will be kept in the regression model.
cor(Peso, N.gravidanze)
## [1] 0.002277118
With this very low value, it can be said that newborn’s weight and number of previous pregnancies are not correlated variables.
cor(Peso, Gestazione)
## [1] 0.5919592
With a correlation coefficient of nearly 0.6, it seems clear that these 2 variables are correlated: the newborn weight is higher if it stays longer in his/her mother’s womb.
A plot is done to visualize this interaction.
ggplot(new_data)+
geom_point(aes(x=Gestazione, y=Peso), position = "jitter", size = 0.5)+
geom_smooth(aes(x=Gestazione, y=Peso))+
labs(title= "Weight = f(Weeks of Gestation)",
x = "Weeks of gestation",
y = "Weight, g")+
theme_light()
We observe a slighlty non-linear trend here.
cor(Peso, Lunghezza)
## [1] 0.7960415
As logically expected, the weight and length variables are highly correlated. Let’s plot this relation to investigate if it is linear or not.
ggplot(new_data)+
geom_point(aes(x=Lunghezza, y=Peso), size = 0.5, position = "jitter")+
geom_smooth(aes(x=Lunghezza, y=Peso))+
labs(title= "Weight = f(Length)",
x = "Length, mm",
y = "Weight, g")+
theme_light()
From around Length = 370 mm, the relation between weight and length seems to be linear. A point draws our attention:
which(Lunghezza == 315) ; Peso[1549]
## [1] 1549
## [1] 4370
It would corresponds to a newborn with atypical features: small length of 315 mm and important weight of 4.37 kg.
So far we keep this data, a further analysis with Cook distance will determine if it corresponds to an abnormal value.
cor(Peso, Cranio)
## [1] 0.7048438
As for length, there is a strong correlation between weight and cranium diameter.
ggplot(new_data)+
geom_point(aes(x=Cranio, y=Peso), size = 0.5)+
geom_smooth(aes(x=Cranio, y=Peso))+
labs(title= "Weight = f(Cranium diameter)",
x = "Cranium diameter, mm",
y = "Weight, g")+
theme_light()
The relation doesn’t seem to be perfectly linear, next, some quadratic effect can be investigated.
Then, we study correlations between the response variable “Peso” and the qualitative variables.
ggplot(new_data)+
geom_boxplot(aes(x = factor(Fumatrici), y = Peso, fill = factor(Fumatrici)))+
labs(title = "Weight = f(Smoking)",
x = " ",
y = "Weight, g")+
theme_light()+
scale_fill_brewer(palette="Set2")
A t-test is done in order to test the correlation between these 2 variables.
t.test(Peso~Fumatrici)
##
## Welch Two Sample t-test
##
## data: Peso by Fumatrici
## t = 1.0362, df = 114.12, p-value = 0.3023
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -45.5076 145.3399
## sample estimates:
## mean in group 0 mean in group 1
## 3286.262 3236.346
p-value equals to 30%: it is not possible to reject H0, so we don’t conclude about a relevant difference of weight mean in respect of the smoking mother behaviour.
ggplot(new_data)+
geom_boxplot(aes(x = Tipo.parto, y = Peso, fill = Tipo.parto))+
labs(title = "Weight = f(Delivery type)",
x = " ",
y = "Weight, g")+
theme_light()+
scale_fill_brewer(palette="Set3")
A t-test is done in order to test the correlation between these 2 variables.
t.test(Peso~Tipo.parto)
##
## Welch Two Sample t-test
##
## data: Peso by Tipo.parto
## t = -0.13626, df = 1494.4, p-value = 0.8916
## alternative hypothesis: true difference in means between group Ces and group Nat is not equal to 0
## 95 percent confidence interval:
## -46.44246 40.40931
## sample estimates:
## mean in group Ces mean in group Nat
## 3282.047 3285.063
Weight means of the 2 groups divided by delivery type are very close: in fact as visible on the boxplot it doesn’t seem to be any difference, and this result is confirmed by the t-test which, with p-value = 89%, absolutly accept the hypothesis of means equality.
For this reason, type of delivery will not be considered in the regression model.
No correlation is expected between hospital and weight, but a t-test is done to confirm this hypothesis.
pairwise.t.test(Peso, Ospedale, paired = F, pool.sd = T, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: Peso and Ospedale
##
## osp1 osp2
## osp2 1.00 -
## osp3 0.33 0.32
##
## P value adjustment method: bonferroni
p-values are >30%, we accept the hypothesis of means equality: there is no correlation between weight value and the place of birth.
This qualitative variable will not be taken in account for the model, as well as type delivery.
ggplot(new_data)+
geom_boxplot(aes(x = Sesso, y = Peso, fill = Sesso))+
labs(title = "Weight = f(Newborn sex)",
x = " ",
y = "Weight, g")+
theme_light()+
scale_fill_brewer(palette="Pastel1")
A difference is visible between female weight and male weight. We test this hypothesis with a t-test.
t.test(Peso~Sesso)
##
## Welch Two Sample t-test
##
## data: Peso by Sesso
## t = -12.115, df = 2488.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group F and group M is not equal to 0
## 95 percent confidence interval:
## -287.4841 -207.3844
## sample estimates:
## mean in group F mean in group M
## 3161.061 3408.496
As p-value is close to 0, we conclude about the relevant difference of weight means between male and female.
A first model is created, with all variables except delivery type and hospital.
mod1 <- lm(Peso~Anni.madre+N.gravidanze+Fumatrici+Gestazione+Lunghezza+Cranio+Sesso, data = new_data)
summary(mod1)
##
## Call:
## lm(formula = Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione +
## Lunghezza + Cranio + Sesso, data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1160.6 -181.3 -15.7 163.6 2630.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6712.2405 141.3339 -47.492 < 2e-16 ***
## Anni.madre 0.8803 1.1491 0.766 0.444
## N.gravidanze 11.3789 4.6767 2.433 0.015 *
## Fumatrici -30.3958 27.6080 -1.101 0.271
## Gestazione 32.9472 3.8288 8.605 < 2e-16 ***
## Lunghezza 10.2316 0.3011 33.979 < 2e-16 ***
## Cranio 10.5198 0.4271 24.633 < 2e-16 ***
## SessoM 78.0787 11.2132 6.963 4.24e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.7 on 2490 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7264
## F-statistic: 948.3 on 7 and 2490 DF, p-value: < 2.2e-16
This model explains nearly 73% of the variability (adjusted R-squared).
As seen before, mother age and smoking habits are not correlated to the weight (p-values of 44% and 27%).
About number of previous pregnancies: p-value of around 1.5%, so can be considered as significant. Beta coefficient equals to 11.38: it means that, all other things being equal, for each previous pregnancy of the mother, the newborn weight increases of 11 g.
To visualize this interaction, hereafter the corresponding plot.
library(dplyr)
by_Preg <-new_data %>%
group_by(N.gravidanze)%>%
summarise(media = mean(Peso))
ggplot(by_Preg, aes(x=N.gravidanze, y=media))+
geom_col(fill="mediumaquamarine", col="black")+
labs(title="Newborn weight vs number of previous pregnancies",
x="Number of previous pregnancies",
y="Weight, in g")+
theme_light()
The interaction is not obvious: it seems that it is mainly influenced by the data obtained for mothers with 7 previous pregnancies.
table(N.gravidanze)
## N.gravidanze
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 1095 817 340 150 48 21 11 1 8 2 3 1 1
With 7 previous pregnancies there is an only one mother. So we decide to remove this variable in next model.
About the others variables:
Number of weeks of gestation: with a p-value close to 0, the result is very significant. The beta coefficient is about 33, it means that for each week more of gestation, the newborn gains 33g.
Anthropomorphic data: both cranium diameter and length present a p-value close to 0, so a very relevant piece of information. Both have beta coefficients of around 10: it means that for each mm more of cranium diameter or of length corresponds an increase of 10 g for the newborn.
Newborn sex: with a very low p-value, sex is a relevant variable here. The beta coefficient of nearly 78 indicates that meanly, the male newborns weight 78 g more than females.
In the light of this information, a new model is created without the number of previous pregnancies, the mother age, and the smoking mother habits.
mod2 <- update(mod1, ~. -N.gravidanze-Anni.madre-Fumatrici, data = new_data)
summary(mod2)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso,
## data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1138.1 -184.4 -17.4 163.3 2626.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6651.6732 135.5952 -49.055 < 2e-16 ***
## Gestazione 31.3262 3.7884 8.269 < 2e-16 ***
## Lunghezza 10.2024 0.3009 33.909 < 2e-16 ***
## Cranio 10.6706 0.4247 25.126 < 2e-16 ***
## SessoM 79.1027 11.2205 7.050 2.31e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.1 on 2493 degrees of freedom
## Multiple R-squared: 0.7261, Adjusted R-squared: 0.7257
## F-statistic: 1652 on 4 and 2493 DF, p-value: < 2.2e-16
The beta coefficients didn’t change too much, and the adjusted R.squared is unchanged as well.
Now we consider quadratic effects, in particular for variable where the scatterplot shows non linear trend as number of weeks of gestation.
mod3 <- update(mod2, ~. +I(Gestazione^2))
summary(mod3)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso +
## I(Gestazione^2), data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1132.76 -181.60 -15.82 162.79 2649.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4640.6017 899.9572 -5.156 2.71e-07 ***
## Gestazione -80.9464 49.8136 -1.625 0.1043
## Lunghezza 10.3056 0.3041 33.892 < 2e-16 ***
## Cranio 10.7671 0.4265 25.247 < 2e-16 ***
## SessoM 76.9130 11.2530 6.835 1.03e-11 ***
## I(Gestazione^2) 1.4988 0.6631 2.260 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.9 on 2492 degrees of freedom
## Multiple R-squared: 0.7267, Adjusted R-squared: 0.7261
## F-statistic: 1325 on 5 and 2492 DF, p-value: < 2.2e-16
We observe that the effect of gestation at 1st grade becomes insignificant (p-value is up to 10%), while the quadratic effect of the variable has p-value below 5%. Considering the adjusted R-squared which is unchanged, to keep the model as simple as possible mod2 is preferred to mod3.
The cranium diameter variable shows a non linear trend vs weight too. We create mod4 with Cranio variable squared, from mod2.
mod4 <- update(mod2, ~. +I(Cranio^2))
summary(mod4)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso +
## I(Cranio^2), data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1127.14 -183.87 -14.67 163.88 2609.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.00064 1153.55946 0.064 0.949
## Gestazione 37.78342 3.92069 9.637 < 2e-16 ***
## Lunghezza 10.44188 0.30165 34.616 < 2e-16 ***
## Cranio -31.40498 7.17967 -4.374 1.27e-05 ***
## SessoM 74.28142 11.17615 6.646 3.68e-11 ***
## I(Cranio^2) 0.06224 0.01060 5.871 4.92e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.3 on 2492 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7293
## F-statistic: 1346 on 5 and 2492 DF, p-value: < 2.2e-16
Both “Cranio” and “Cranio”^2 are significative, as there p-values are very low, but beta coefficient of “Cranio”^2 is close to 0, so the interaction with weight is not important.
Adjusted R.squared stays equal, so in order to keep the model as simple as possible, “Cranio”^2 is not kept in final model.
So far, mod2 is the best model.
Interactions between variables are studied. We can consider it between length and weeks of gestation.
mod5 <- update(mod2, ~. +Lunghezza*Gestazione)
summary(mod5)
##
## Call:
## lm(formula = Peso ~ Gestazione + Lunghezza + Cranio + Sesso +
## Gestazione:Lunghezza, data = new_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1121.91 -181.73 -13.56 166.26 2639.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.031e+03 9.220e+02 -2.202 0.027728 *
## Gestazione -9.311e+01 2.485e+01 -3.747 0.000183 ***
## Lunghezza 2.599e-02 2.031e+00 0.013 0.989789
## Cranio 1.089e+01 4.249e-01 25.639 < 2e-16 ***
## SessoM 7.350e+01 1.122e+01 6.551 6.91e-11 ***
## Gestazione:Lunghezza 2.688e-01 5.305e-02 5.067 4.34e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 2492 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7284
## F-statistic: 1340 on 5 and 2492 DF, p-value: < 2.2e-16
Length as single effect is not significative anymore. The improvement of adjusted R-squared is minimal.
Akaike and Bayes information criteria are calculated for all the created models.
AIC(mod1, mod2, mod3, mod4, mod5)
## df AIC
## mod1 9 35155.07
## mod2 6 35159.12
## mod3 7 35156.01
## mod4 7 35126.81
## mod5 7 35135.52
According to AIC data, mod4 is the best.
BIC(mod1, mod2, mod3, mod4, mod5)
## df BIC
## mod1 9 35207.48
## mod2 6 35194.06
## mod3 7 35196.77
## mod4 7 35167.58
## mod5 7 35176.28
Based on BIC values, mod4 is the best.
We preferred mod2, while Akaike and Bayes information criteria indicate mod4.
We test stepAIC from MASS package in order to check this result:
# Creation of a model with all variables:
mod6 <- lm(Peso~., data = new_data)
# Procedura stepAIC
stepwise.mod <- MASS :: stepAIC(mod6, direction="both", k=log(n))
## Start: AIC=28118.61
## Peso ~ Anni.madre + N.gravidanze + Fumatrici + Gestazione + Lunghezza +
## Cranio + Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Anni.madre 1 36710 186779904 28111
## - Fumatrici 1 90677 186833870 28112
## - Ospedale 2 687555 187430749 28112
## - N.gravidanze 1 446244 187189438 28117
## - Tipo.parto 1 451073 187194266 28117
## <none> 186743194 28119
## - Sesso 1 3610705 190353899 28159
## - Gestazione 1 5458852 192202046 28183
## - Cranio 1 45318506 232061700 28654
## - Lunghezza 1 87861708 274604902 29074
##
## Step: AIC=28111.28
## Peso ~ N.gravidanze + Fumatrici + Gestazione + Lunghezza + Cranio +
## Tipo.parto + Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Fumatrici 1 91599 186871503 28105
## - Ospedale 2 693914 187473818 28105
## - Tipo.parto 1 452049 187231953 28110
## <none> 186779904 28111
## - N.gravidanze 1 631082 187410986 28112
## + Anni.madre 1 36710 186743194 28119
## - Sesso 1 3617809 190397713 28151
## - Gestazione 1 5424800 192204704 28175
## - Cranio 1 45569477 232349381 28649
## - Lunghezza 1 87852027 274631931 29066
##
## Step: AIC=28104.68
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Ospedale + Sesso
##
## Df Sum of Sq RSS AIC
## - Ospedale 2 702925 187574428 28098
## - Tipo.parto 1 444404 187315907 28103
## <none> 186871503 28105
## - N.gravidanze 1 608136 187479640 28105
## + Fumatrici 1 91599 186779904 28111
## + Anni.madre 1 37633 186833870 28112
## - Sesso 1 3601860 190473363 28145
## - Gestazione 1 5358199 192229702 28168
## - Cranio 1 45613331 232484834 28642
## - Lunghezza 1 88259386 275130889 29063
##
## Step: AIC=28098.41
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Tipo.parto +
## Sesso
##
## Df Sum of Sq RSS AIC
## - Tipo.parto 1 467626 188042054 28097
## <none> 187574428 28098
## - N.gravidanze 1 648873 188223301 28099
## + Ospedale 2 702925 186871503 28105
## + Fumatrici 1 100610 187473818 28105
## + Anni.madre 1 44184 187530244 28106
## - Sesso 1 3644818 191219246 28139
## - Gestazione 1 5457887 193032315 28162
## - Cranio 1 45747094 233321522 28636
## - Lunghezza 1 87955701 275530129 29051
##
## Step: AIC=28096.81
## Peso ~ N.gravidanze + Gestazione + Lunghezza + Cranio + Sesso
##
## Df Sum of Sq RSS AIC
## <none> 188042054 28097
## - N.gravidanze 1 621053 188663107 28097
## + Tipo.parto 1 467626 187574428 28098
## + Ospedale 2 726146 187315907 28103
## + Fumatrici 1 92548 187949505 28103
## + Anni.madre 1 45366 187996688 28104
## - Sesso 1 3650790 191692844 28137
## - Gestazione 1 5477493 193519547 28161
## - Cranio 1 46098547 234140601 28637
## - Lunghezza 1 87532691 275574744 29044
The model as the best one following this procedure would be the equivalent of the mod2, but keeping the number of previous pregnancies as relevant data. As we have seen, this is mainly due to an only one observation with 7 previous pregnancies. So we decide to keep as the best one our mod2.
We consider mod2 as our final and best regression model.
RMSE_2 <- sqrt(mean(mod2$residuals^2))
print(RMSE_2)
## [1] 274.8193
RMSE measures the average difference between values predicted by the model and the actual values. So using mod2, the average gap between predicted value and reel one is nearly 275g on the variable weight.
par(mfrow = c(2,2))
plot(mod2)
On the previous Residuals vs Leverage plot, the 1551 point presents a Cook distance of 1.
Let’s see in details the data corresponding to this observation:
new_data[1549, ]
## Anni.madre N.gravidanze Fumatrici Gestazione Peso Lunghezza Cranio
## 1551 35 1 0 38 4370 315 374
## Tipo.parto Ospedale Sesso
## 1551 Nat osp3 F
It corresponds to a female newborn from a non-smoker mother, who already had a pregnancy, with a gestation of 38 weeks. Her anthropomorphic characteristics seem abnormal: considering the statistics of this database, the lenght is very short (315 mm against a median of 500 mm), the cranium diameter is pretty high (374 mm against a mean of 340 mm), and the weight is high too (4370 g against a mean of 3284 g).
We decide to remove this value, as it is placed on the 1 “alert” line.
new_data <- new_data[-c(1549),]
mean(residuals(mod2))
## [1] 1.72925e-14
As requested for linear regression model, residuals mean is equal to 0.
library(lmtest)
shapiro.test(residuals(mod2))
##
## Shapiro-Wilk normality test
##
## data: residuals(mod2)
## W = 0.97426, p-value < 2.2e-16
p-value is close to 0: residuals don’t follow a normal distribution.
Looking at the Q-Q plot, extreme values indeed are out of the bisector plot.
bptest(mod2)
##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 89.198, df = 4, p-value < 2.2e-16
As p-value is very low, we reject H0, homoscedasticity hypothesis, even if on the plot, the points seem to be casually placed around an horizontal value.
dwtest(mod2)
##
## Durbin-Watson test
##
## data: mod2
## DW = 1.9553, p-value = 0.1318
## alternative hypothesis: true autocorrelation is greater than 0
As p-value is equals to 13%, we don’t reject H0: residuals are not correlated.
library(car)
vif(mod2)
## Gestazione Lunghezza Cranio Sesso
## 1.654101 2.070582 1.606316 1.038918
As required to validate a model, all VIF (Variance Inflation Factors) have values below 5, what signify a non multicollinearity between variables.
Weight of a female newborn, from a mother with 2 previous pregnancies, and a gestation of 39 weeks.
predict(mod2,
newdata = data.frame(
Sesso="F",
Gestazione=39,
Lunghezza=mean(Lunghezza),
Cranio=mean(Cranio)))
## 1
## 3245.462
The predicted weight is equal to 3245 g.
ggplot(data=new_data)+
geom_point(aes( x = Gestazione, y = Peso, col = Sesso), position = "jitter", size = 0.5)+
geom_smooth(aes( x = Gestazione, y = Peso, col = Sesso), se = F, method = "lm")+
labs(title = "Weight as function of weeks of gestation for male and female",
x = "Gestation, in weeks",
y= "Weight, in g")+
theme_light()
Weight newborn increases with weeks of gestation. As male weight is higher as female’s, the curve for male is a little higher than female’s one, but the slope is the same for both.
ggplot(data=new_data)+
geom_point(aes( x = Gestazione, y = Peso, col = factor(Fumatrici)), position = "jitter", size = 0.5)+
geom_smooth(aes( x = Gestazione, y = Peso, col = factor(Fumatrici)), se = F, method = "lm")+
labs(title = "Weight as function of weeks of gestation for smoking and non- mothers",
x = "Gestation, in weeks",
y= "Weight, in g")+
theme_light()+
scale_color_manual(name = "Legend: ", values=c("aquamarine3", "coral1"), labels = c("Non-smoking", "Smoking"))
There is a lower slope for the data about smoking mothers: for long gestations (beyond 36 weeks), newborn weight gets slightly lower than newborns from non-smoking mothers.
A regression model has been developed in order to explain which variables can affect newborns weight and to predict it. In this study the most useful part is the prediction: it is important to be able to anticipate this data before birth, to organize human and material resources to take care about most fragile newborns, the ones with lower weights.
The most relevant variables are weeks of gestation and sex, and as control variables length and cranium diameter increase with the weight (the baby becomes bigger, so heavier and his measurements grow as well).
As surprising as it may sound, some variables as smoking mother habits don’t influence the weight, as well as mother age.