In this project, we will analyze the dataset of birth information on babies born in USA during 2006. Each record is one birth.
Datset is used as an example in the book “R in a Nutshell” from O’Reilly Media.
load("C:/Users/Quoc Nguyen/Downloads/births2006.smpl.rda")
df<-births2006.smpl
dim(df)
## [1] 427323 13
str(df)
## 'data.frame': 427323 obs. of 13 variables:
## $ DOB_MM : int 9 2 2 10 7 3 5 4 10 4 ...
## $ DOB_WK : int 1 6 2 5 7 3 2 7 3 4 ...
## $ MAGER : int 25 28 18 21 25 28 33 31 18 24 ...
## $ TBO_REC : int 2 2 2 2 1 3 2 3 1 2 ...
## $ WTGAIN : int NA 26 25 6 36 35 26 25 46 43 ...
## $ SEX : Factor w/ 2 levels "F","M": 1 2 1 2 2 2 2 1 1 2 ...
## $ APGAR5 : int NA 9 9 9 10 8 9 9 9 9 ...
## $ DMEDUC : Factor w/ 18 levels "1 year of college",..: 18 4 18 18 6 18 18 4 18 6 ...
## $ UPREVIS : int 10 10 14 22 15 18 10 19 15 13 ...
## $ ESTGEST : int 99 37 38 38 40 39 38 38 40 40 ...
## $ DMETH_REC: Factor w/ 3 levels "C-section","Unknown",..: 3 3 3 3 3 3 1 1 1 3 ...
## $ DPLURAL : Factor w/ 5 levels "1 Single","2 Twin",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ DBWT : int 3800 3625 3650 3045 3827 3090 3430 3204 3227 3459 ...
Birth dataset contains 427323 records, with 13 different variables.
And each variable is described as follows:
DOB_MM: Month of date of birth
DOB_WK: Day of week of birth
MAGER: Mother’s age
TBO_REC: Total birth order
WTGAIN: Weight gain by mother
SEX: a factor with levels F M, representing the sex of the child
APGAR5: APGAR score
DMEDUC: Mother’s education level
UPREVIS: Number of prenatal visits
ESTGEST: Estimated weeks of gestation
DMETH_REC: Delivery Method
DPLURAL: “Plural Births;” levels include 1 Single, 2 Twin, 3 Triplet, 4 Quadruplet, and 5 Quintuplet or higher
DBWT: Birth weight, in grams
First, we want to know the proportion of male and female babies born during 2006.
prop.table(table(df$SEX))
##
## F M
## 0.4882794 0.5117206
There are 51.17% babies were male, while 48.82% were female.
Next, we will figure out which month has the highest born babies in 2006.
dob_mm<-table(df$DOB_MM)
barplot(dob_mm,main="Months of birth",xlab='Month',ylab="Number of born")
There are no difference between the birth month during 2006. How about the frequencies of births during the day of week.
Denote: 1: Sunday, 2: Monday,…
dob_week<-table(df$DOB_WK)
barplot(dob_week,main="Day of birth",xlab='Day',ylab="Number of born")
It shows that fewer births took place on the weekends. This may have to do with the fact that many babies are delivered by cesarean section, and that those deliveries are typically scheduled on the weekdays not at the weekends. To answer this, we will find out the frequencies of deliveries according to days in the week
a<-table(WK=df$DOB_WK,MM=df$DMETH_REC)
print(a)
## MM
## WK C-section Unknown Vaginal
## 1 8836 90 31348
## 2 20454 272 42031
## 3 22921 247 46607
## 4 23103 252 46935
## 5 22825 258 47081
## 6 23233 289 44858
## 7 10696 109 34878
We will eliminate unknown delivery to clean our data.
a<-a[,-2]
print(a)
## MM
## WK C-section Vaginal
## 1 8836 31348
## 2 20454 42031
## 3 22921 46607
## 4 23103 46935
## 5 22825 47081
## 6 23233 44858
## 7 10696 34878
a<-data.frame(a)
library(ggplot2)
ggplot(a,aes(x=Freq,y=WK,fill=MM)) + geom_bar(position="dodge",stat='identity') + labs(title="The frequencies of two deliveries each day of the week")
We see that most mother would prefer to have the vaginal section, and actually both deliveries usually took place during the weekdays. It could be the case when doctors want to make sure that they will have enough employee for every delivery.
We will see the proportion of plural births, and to see if the plural birth related with the birth weight of babies born.
table(df$DPLURAL)
##
## 1 Single 2 Twin 3 Triplet
## 412979 13658 642
## 4 Quadruplet 5 Quintuplet or higher
## 39 5
As expected, the majority of birth cases are single birth. The second largest will be twin. However, there had extremely rare cases where there were 4 babies were born as the same time or the quintuplet or higher cases.
For some medical report, the plural birth (more than 1 birth) usually had the lower weeks of gestation. we will find out if this is true for dataset in 2006.
ggplot(df,aes(x=df$ESTGEST,y=df$DPLURAL,fill=df$DPLURAL)) + geom_boxplot() + labs(title = "Estimated weeks of gestation",x='Number of gestation weeks',y='Plural births')
The boxplot proves that usually the lower plural birth will have the higher gestation weeks. For example, the single birth had the longest week of gestation, while others had the lower distribution of gestation weeks.
Next, we will see if the birth weight will be affected by plural births or delivery method.
#Density plot
ggplot(df,aes(x=df$DBWT,fill=df$DPLURAL))+ geom_density() + facet_grid(rows = vars(df$DPLURAL))
#Average birth weight of each plural birth
tapply(df$DBWT,df$DPLURAL,mean,na.rm=TRUE)
## 1 Single 2 Twin 3 Triplet
## 3298.263 2327.478 1677.017
## 4 Quadruplet 5 Quintuplet or higher
## 1196.105 1142.800
#Boxplot
ggplot(df,aes(x=df$DBWT,y=df$DPLURAL,fill=df$SEX))+ geom_boxplot()
#Average birth weight of each plural birth according to gender
tapply(df$DBWT,list(df$DPLURAL,df$SEX),mean,na.rm=TRUE)
## F M
## 1 Single 3242.302 3351.637
## 2 Twin 2279.508 2373.819
## 3 Triplet 1697.822 1655.348
## 4 Quadruplet 1319.556 1085.000
## 5 Quintuplet or higher 1007.667 1345.500
The plot and the mean value shows that birth weight decreases with multiple births due to the lower gestation weeks. Moreover, we also see that newborn baby girls usually has the birth weight lower than boys. #Boxplot
ggplot(df,aes(x=df$DBWT,y=df$DMETH_REC,fill=df$DMETH_REC))+ geom_boxplot()
While the delivery methods does not affect the birht weight of babies.
Some research said that women who gain more weight in early pregnancy are more likely to deliver unusually large babies, who may be prone to a host of health problems later in life, new research shows.
ggplot(df,aes(x=df$DBWT,y=df$WTGAIN))+ geom_point()
The relationship between between birth weight of weight gained by mother is not strong but there is a positive correlation; which means that the higher weight gained, the higher weight of born babies.
We also want to understand the relationship between the APGAR score and the birth weight. The APGAR score indicates the health status of a newborn. Lower APGAR means the newborn has difficulties.
ggplot(df,aes(y=df$DBWT,x=as.factor(df$APGAR5),fill=as.factor(df$APGAR5)))+ geom_boxplot()
In general, the newborn has the lower weight when they have lower APGAR.