library(readxl)
library(data.table)
library(knitr)
library(tidyverse)
library(stringr)
library(flextable)
neonatal <- read_excel("neonatal.xlsx")
format(head(neonatal))
## [1] "# A tibble: 6 × 7"
## [2] " studyid bmicat birthweight lga gesta…¹ hypog…² admis…³"
## [3] " <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> "
## [4] "1 B4001 30-34.9(obese) 3420 normalsize 40.2 No No "
## [5] "2 B4002 30-34.9(obese) 2955 normalsize 38.1 No No "
## [6] "3 B4003 35-39.9 (severly obese) 5054 LGA 42 No No "
## [7] "4 B4004 30-34.9(obese) 3455 normalsize 39.6 No No "
## [8] "5 B4005 35-39.9 (severly obese) 4510 LGA 39.6 No No "
## [9] "6 B4006 30-34.9(obese) 3210 normalsize 37.3 No Yes "
## [10] "# … with abbreviated variable names ¹gestationalage, ²hypoglycaemia, ³admission"
set_flextable_defaults(
font.size = 10, font.family = "Helvetica",
font.color = "#333333",
table.layout = "fixed",
border.color = "gray",
padding.top = 3, padding.bottom = 3,
padding.left = 4, padding.right = 4)
number of observations are given by which is total size of the population
nrow(neonatal)
## [1] 2466
format(glimpse(neonatal))
## Rows: 2,466
## Columns: 7
## $ studyid <chr> "B4001", "B4002", "B4003", "B4004", "B4005", "B4006", "…
## $ bmicat <chr> "30-34.9(obese)", "30-34.9(obese)", "35-39.9 (severly o…
## $ birthweight <dbl> 3420, 2955, 5054, 3455, 4510, 3210, 3385, 3110, 3320, 3…
## $ lga <chr> "normalsize", "normalsize", "LGA", "normalsize", "LGA",…
## $ gestationalage <dbl> 40.2, 38.1, 42.0, 39.6, 39.6, 37.3, 40.0, 37.2, 40.5, 4…
## $ hypoglycaemia <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
## $ admission <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", …
## [1] "# A tibble: 2,466 × 7"
## [2] " studyid bmicat birthweight lga gesta…¹ hypog…² admis…³"
## [3] " <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> "
## [4] " 1 B4001 30-34.9(obese) 3420 normalsi… 40.2 No No "
## [5] " 2 B4002 30-34.9(obese) 2955 normalsi… 38.1 No No "
## [6] " 3 B4003 35-39.9 (severly obese) 5054 LGA 42 No No "
## [7] " 4 B4004 30-34.9(obese) 3455 normalsi… 39.6 No No "
## [8] " 5 B4005 35-39.9 (severly obese) 4510 LGA 39.6 No No "
## [9] " 6 B4006 30-34.9(obese) 3210 normalsi… 37.3 No Yes "
## [10] " 7 B4007 30-34.9(obese) 3385 normalsi… 40 No No "
## [11] " 8 B4008 30-34.9(obese) 3110 normalsi… 37.2 No No "
## [12] " 9 B4009 30-34.9(obese) 3320 normalsi… 40.5 No No "
## [13] "10 B4010 30-34.9(obese) 3418 normalsi… 40.5 No No "
## [14] "# … with 2,456 more rows, and abbreviated variable names ¹gestationalage,"
## [15] "# ²hypoglycaemia, ³admission"
so all the columns are character variables except the birthweight We need to characterize the data accordinf to three groups in bmicat which can be done bu diving the bmi column and filtering rows with condition.
unique(neonatal$bmicat)
## [1] "30-34.9(obese)" "35-39.9 (severly obese)"
## [3] ">=40 (morbidly obese)"
There are three unique values which means we have 3 groups in the bmicat column. Now we are going to classify the three groups.
obese <- neonatal %>% filter(bmicat %in% c("30-34.9(obese)"))
severely_obese <- neonatal %>% filter(bmicat %in% c("35-39.9 (severly obese)"))
morbidly_obese <- neonatal %>% filter(bmicat %in% c(">=40 (morbidly obese)"))
For the class of obese patients, we need to find the number of patients who were admitted to the NICU at birth and their mean and standard deviation. In the below chunk we are calculating the median and the IQR of the birthweight of the class of all 3 groups of patients as well as the mean median of all groups in the form of a table
#
df1 <- data.frame(
"Groups" = c("Obese", "Severly Obese", "Morbidly Obese","All"),
c(mean(obese$birthweight),mean(severely_obese$birthweight),mean(morbidly_obese$birthweight),mean(neonatal$birthweight)),
c(sd(obese$birthweight),sd(severely_obese$birthweight),sd(morbidly_obese$birthweight),sd(neonatal$birthweight)),
c(IQR(obese$gestationalage), IQR(severely_obese$gestationalage), IQR(morbidly_obese$gestationalage),IQR(neonatal$gestationalage)),
c(median(obese$gestationalage), median(severely_obese$gestationalage), median(morbidly_obese$gestationalage),median(neonatal$gestationalage)))
df1 <- setNames(df1,c("Groups","mean","SD","IQR","median"))
kable(df1)
| Groups | mean | SD | IQR | median |
|---|---|---|---|---|
| Obese | 3447.420 | 652.9939 | 1.9 | 39.3 |
| Severly Obese | 3445.350 | 658.7284 | 1.8 | 39.2 |
| Morbidly Obese | 3450.111 | 783.3571 | 1.9 | 39.1 |
| All | 3447.216 | 666.9211 | 1.9 | 39.3 |
With regards to the LGA we can calculate the percentage by classifying into 2 groups. before hand we can identify if the groups are more than 2 in LGA by using unique command.
unique(neonatal$lga)
## [1] "normalsize" "LGA"
df2 <- data.frame(
"Groups" = c("Obese", "Severly Obese", "Morbidly Obese","All"),
c(obese %>% filter(lga %in% c("normalsize")) %>% nrow()/nrow(obese)*100,
severely_obese %>% filter(lga %in% c("normalsize")) %>% nrow()/nrow(severely_obese)*100,
morbidly_obese %>% filter(lga %in% c("normalsize")) %>% nrow()/nrow(morbidly_obese)*100,
neonatal %>% filter(lga %in% c("normalsize")) %>% nrow()/nrow(neonatal)*100),
c(obese %>% filter(lga %in% c("LGA")) %>% nrow()/nrow(obese)*100,
severely_obese %>% filter(lga %in% c("LGA")) %>% nrow()/nrow(severely_obese)*100,
morbidly_obese %>% filter(lga %in% c("LGA")) %>% nrow()/nrow(morbidly_obese)*100,
neonatal %>% filter(lga %in% c("LGA")) %>% nrow()/nrow(neonatal)*100),
c(obese %>% filter(lga %in% c("normalsize")) %>% nrow(),
severely_obese %>% filter(lga %in% c("normalsize")) %>% nrow(),
morbidly_obese %>% filter(lga %in% c("normalsize")) %>% nrow(),
neonatal %>% filter(lga %in% c("normalsize")) %>% nrow()),
c(obese %>% filter(lga %in% c("LGA")) %>% nrow(),
severely_obese %>% filter(lga %in% c("LGA")) %>% nrow(),
morbidly_obese %>% filter(lga %in% c("LGA")) %>% nrow(),
neonatal %>% filter(lga %in% c("LGA")) %>% nrow())
)
df2 <- setNames(df2,c("Groups","normalsize %","lga %","number of normalsize","number of lga"))
flextable(df2)
Groups | normalsize % | lga % | number of normalsize | number of lga |
|---|---|---|---|---|
Obese | 73.63476 | 26.36524 | 1,254 | 449 |
Severly Obese | 71.88082 | 28.11918 | 386 | 151 |
Morbidly Obese | 67.25664 | 32.74336 | 152 | 74 |
All | 72.66829 | 27.33171 | 1,792 | 674 |
Now for the Hypoglycaemia we can calculate the percentage.
df3 <- data.frame(
"Groups" = c("Obese", "Severly Obese", "Morbidly Obese","All"),
c(obese %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow()/nrow(obese)*100,
severely_obese %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow()/nrow(severely_obese)*100,
morbidly_obese %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow()/nrow(morbidly_obese)*100,
neonatal %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow()/nrow(neonatal)*100),
c(obese %>% filter(hypoglycaemia %in% c("No")) %>% nrow()/nrow(obese)*100,
severely_obese %>% filter(hypoglycaemia %in% c("No")) %>% nrow()/nrow(severely_obese)*100,
morbidly_obese %>% filter(hypoglycaemia %in% c("No")) %>% nrow()/nrow(morbidly_obese)*100,
neonatal %>% filter(hypoglycaemia %in% c("No")) %>% nrow()/nrow(neonatal)*100),
c(obese %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow(),
severely_obese %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow(),
morbidly_obese %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow(),
neonatal %>% filter(hypoglycaemia %in% c("Yes")) %>% nrow()),
c(obese %>% filter(hypoglycaemia %in% c("No")) %>% nrow(),
severely_obese %>% filter(hypoglycaemia %in% c("No")) %>% nrow(),
morbidly_obese %>% filter(hypoglycaemia %in% c("No")) %>% nrow(),
neonatal %>% filter(hypoglycaemia %in% c("No")) %>% nrow())
)
df3 <- setNames(df3,c("Groups","yes %","no %","number for yes","number for no"))
kable(df3)
| Groups | yes % | no % | number for yes | number for no |
|---|---|---|---|---|
| Obese | 7.516148 | 92.48385 | 128 | 1575 |
| Severly Obese | 6.703911 | 93.29609 | 36 | 501 |
| Morbidly Obese | 10.619469 | 89.38053 | 24 | 202 |
| All | 7.623682 | 92.37632 | 188 | 2278 |
For the NICU admission at birth we can calculate it’s percentage by the code below
df4 <- data.frame(
"Groups" = c("Obese", "Severly Obese", "Morbidly Obese","All"),
c(obese %>% filter(admission %in% c("Yes")) %>% nrow()/nrow(obese)*100,
severely_obese %>% filter(admission %in% c("Yes")) %>% nrow()/nrow(severely_obese)*100,
morbidly_obese %>% filter(admission %in% c("Yes")) %>% nrow()/nrow(morbidly_obese)*100,
neonatal %>% filter(admission %in% c("Yes")) %>% nrow()/nrow(neonatal)*100),
c(obese %>% filter(admission %in% c("No")) %>% nrow()/nrow(obese)*100,
severely_obese %>% filter(admission %in% c("No")) %>% nrow()/nrow(severely_obese)*100,
morbidly_obese %>% filter(admission %in% c("No")) %>% nrow()/nrow(morbidly_obese)*100,
neonatal %>% filter(admission %in% c("No")) %>% nrow()/nrow(neonatal)*100),
c(obese %>% filter(admission %in% c("Yes")) %>% nrow(),
severely_obese %>% filter(admission %in% c("Yes")) %>% nrow(),
morbidly_obese %>% filter(admission %in% c("Yes")) %>% nrow(),
neonatal %>% filter(admission %in% c("Yes")) %>% nrow()),
c(obese %>% filter(admission %in% c("No")) %>% nrow(),
severely_obese %>% filter(admission %in% c("No")) %>% nrow(),
morbidly_obese %>% filter(admission %in% c("No")) %>% nrow(),
neonatal %>% filter(admission %in% c("No")) %>% nrow()/nrow(neonatal)*100)
)
df4 <- setNames(df4,c("Groups","yes %","no %","number for yes","number for no"))
kable(df4)
| Groups | yes % | no % | number for yes | number for no |
|---|---|---|---|---|
| Obese | 20.02349 | 79.97651 | 341 | 1362.00000 |
| Severly Obese | 18.99441 | 81.00559 | 102 | 435.00000 |
| Morbidly Obese | 18.58407 | 81.41593 | 42 | 184.00000 |
| All | 19.66748 | 80.33252 | 485 | 80.33252 |
There are 3 groups in here so we can do an 3 groups Analysis of variance test statistical analysis test to see if the groups are different. before doing the test we need to define the hypothesis at 5% significance level which will help us to determine if the groups there exists a significant difference between them.
Null Hypothesis: H0: Birthweight of the newborns is the same in all groups. Alternate Hypothesis: H1: Birthweight of the newborns is different in the groups.
anova <- aov(birthweight~bmicat,data=neonatal)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## bmicat 2 3.834e+03 1917 0.004 0.996
## Residuals 2463 1.096e+09 445143
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = birthweight ~ bmicat, data = neonatal)
##
## $bmicat
## diff lwr upr
## 30-34.9(obese)->=40 (morbidly obese) -2.690185 -113.45959 108.0792
## 35-39.9 (severly obese)->=40 (morbidly obese) -4.760526 -128.82178 119.3007
## 35-39.9 (severly obese)-30-34.9(obese) -2.070341 -79.50668 75.3660
## p adj
## 30-34.9(obese)->=40 (morbidly obese) 0.9982132
## 35-39.9 (severly obese)->=40 (morbidly obese) 0.9955455
## 35-39.9 (severly obese)-30-34.9(obese) 0.9978350
Our result show that p-value is much large than 0.05 so we can reject the Alternate hypothesis. TukeyHSD test is used to determine how much the groups differ from each other. Our results show that p-value for all the 3 groups is very large so we conclude that all of the 3 groups are similar to each other. Moreoever the F-stats value is very small which also supports our hypothesis.
Null hypothesis H0: Gestational age is the same in all groups. Alternate hypothesis H1: Gestational age is different in the groups.
anova <- aov(gestationalage~bmicat,data=neonatal)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## bmicat 2 76 38.04 5.803 0.00306 **
## Residuals 2463 16147 6.56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = gestationalage ~ bmicat, data = neonatal)
##
## $bmicat
## diff lwr upr
## 30-34.9(obese)->=40 (morbidly obese) 0.61703605 0.19194624 1.0421259
## 35-39.9 (severly obese)->=40 (morbidly obese) 0.52766599 0.05156726 1.0037647
## 35-39.9 (severly obese)-30-34.9(obese) -0.08937006 -0.38654053 0.2078004
## p adj
## 30-34.9(obese)->=40 (morbidly obese) 0.0019496
## 35-39.9 (severly obese)->=40 (morbidly obese) 0.0254625
## 35-39.9 (severly obese)-30-34.9(obese) 0.7603987
Our results show that p-value is less than 0.05 so we reject our null hypothesis and we can say that the groups are different in geostational age. Note that residuals are also less here as compared to the previous analysis. The asteric next to p-value also shows the significance of the test. Tukeytest here shows the difference between the groups. The p-value for the severaly obese and obese comes out to be more than 0.05 so we can say that there is a high significant difference between these 2 groups as compared to obese and morbidltly obese as well as as the several obese and morbdily obese.
here the data is categorical so we will use the chisquare test. For the test again we define our hypothesis as
Null hypothesis H0: LGA is the same in all groups. Alternate hypothesis H1: LGA is different in the groups.
chisq.test(table(neonatal$lga))
##
## Chi-squared test for given probabilities
##
## data: table(neonatal$lga)
## X-squared = 506.86, df = 1, p-value < 2.2e-16
The results show that p-value is less than 0.05 so we reject our null hypothesis and we can say that the groups are different in LGA. our degrees of freedom is 1 so we can say that the test is independent.
Null hypothesis H0: Hypoglycaemia is the same in all groups. Alternate hypothesis H1: Hypoglycaemia is different in the groups.
chisq.test(table(neonatal$hypoglycaemia))
##
## Chi-squared test for given probabilities
##
## data: table(neonatal$hypoglycaemia)
## X-squared = 1771.3, df = 1, p-value < 2.2e-16
In this case the groups are different in the hypoglycaemia because the p-value is less than 0.05. The degrees of freedom is 1 so we can say that the test is independent. X-squared value is 1771.8 so we can say that the test is not significant. Remember that actual degreess of freedom is 2 but R always takes value 1 less than size of population.
Null hypothesis H0: Admission is the same in all groups. Alternate hypothesis H1: Admission is different in the groups.
chisq.test(table(neonatal$admission))
##
## Chi-squared test for given probabilities
##
## data: table(neonatal$admission)
## X-squared = 907.55, df = 1, p-value < 2.2e-16
Admission is different in the groups. The p-value is less than 0.05. X-squared value of chisquare test is 907.5 so we can say that the test is not significant.
The analysis is based on the following questions: * Is the BMI is siginficaltly different in the groups * What’s the relation between the BMI and neonatal outcomes.
We have performed several statistical tests to determine the
significance of the data such ANOVA, chi-square and Tukey test. The
result for the tests with numerical data column such as birthweight and
geostational have different p-values at 5% significance level with 95%
confidence interval. For the birthweight the p-value is >0.05 which
shows that birthweight for the 3 groups are almost identical. It is also
verified from the mean values of the birth weight where all values are
near to 3450. IQR is also very small so we can say that the birthweight
is not significantly different in the groups. With regards to
geostational age the p-value is <0.05 which shows that the
geostational age is significantly different in the groups. Moreover the
Tukey test show that the two groups of
35-39.9 (severly obese)-30-34.9(obese) are not
siginficaltly different as compared to all other combination of
groups.
There are 3 categorical variables in the dataset which required the chi-squared test of significance. In all the 3 categorical variables the p-value is <0.05 which shows that the categorical variables are significantly different in the groups. Overall analysis shows that LGA, Hypoglycaemia and Admissions for all the 3 groups are significantly different from BMI while there exists some corelation for birthweight and hypoglycaemia.