North Carolina Birth Records in 2001

Child birth Analysis

Aditi Sharma- S3795604

Last updated: 27 October, 2019

Introduction

Problem Statement

We will look through the data and analyse the birth weight of child basis their mom’s age which is categorize in “below 30” & “30 and above”.

Our problem statement for Null hypothesis says that mothers with the age group above 30 are more likely to give birth to healthy child.

Down the assignment program we will perform various testing methods to prove our statement.

Data

https://vincentarelbundock.github.io/Rdatasets/doc/Stat2Data/NCbirths.html

Data: NCBirths.csv Variables: 16 Observations:1451

This dataset contains data on a sample of 1451 birth records that statistician John Holcomb selected from the North Carolina State Center for Health and Environmental Statistics.

Data Cont.

Descriptive Statistics and Visualisation

Description of variables

• ID Patient ID code • Plural 1=single birth, 2=twins, 3=triplets • Sex Sex of the baby 1=male 2=female • MomAge: Mother’s age (in years) • Weeks (Completed weeks of gestation) • Marital(Marital status): 1=married or 2=not married • RaceMom (Mother’s race): 1=white, 2=black, 3=American Indian, 4=Chinese 5=Japanese, 6=Hawaiian, 7=Filipino, or 8=Other Asian or Pacific Islander
• HispMom(Hispanic origin of mother): C=Cuban, M=Mexican, N=not Hispanic O= Other Hispanic, P=Puerto Rico, S=Central/South America

• Gained: Weight gained during pregnancy (in pounds) • Smoke(Smoker mom?) 1=yes or 0=no • Birth(WeightOz): Birth weight in ounces • Birth(WeightGm): Birth weight in grams • Low(Indicator for low birth weight):: 1=2500 grams or less • Premie Indicator for premature birth, 1=36 weeks or sooner • MomRace(Mother’s race): black, hispanic, other, or white

– In this section, we will calculate descriptive statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values)

– We located less than 3% missing data, which according to data preprocessing is safe to exclude using na.rm=TRUE from the next section onwards.

– We used univariate way to find the outliers using box plot.

getwd()
## [1] "C:/Users/aditi/OneDrive/Documents/Course/Intro to Statistics/Assignment 3"
ncb <- read.csv("NCbirths.csv", stringsAsFactors = FALSE)


#Determining the structue of data frame
str(ncb)
## 'data.frame':    1450 obs. of  16 variables:
##  $ ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Plural       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex          : int  1 2 1 1 1 1 2 2 2 2 ...
##  $ MomAge       : int  32 32 27 27 25 28 25 15 21 27 ...
##  $ Age.bracket  : chr  "Above 30" "Above 30" "Less than 30" "Less than 30" ...
##  $ Weeks        : int  40 37 39 39 39 43 39 42 39 40 ...
##  $ Marital      : int  1 1 1 1 1 1 1 2 1 2 ...
##  $ RaceMom      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ HispMom      : chr  "N" "N" "N" "N" ...
##  $ Gained       : int  38 34 12 15 32 32 75 25 28 37 ...
##  $ Smoke        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BirthWeightOz: int  111 116 138 136 121 117 143 113 120 124 ...
##  $ BirthWeightGm: num  3147 3289 3912 3856 3430 ...
##  $ Low          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Premie       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MomRace      : chr  "white" "white" "white" "white" ...
#determining the class of data frame
class(ncb)
## [1] "data.frame"
#Factorized the variable
ncb$Age.bracket <- as.factor(ncb$Age.bracket) 



#Determining class of factor variable
class(ncb$Age.bracket)
## [1] "factor"
#To determine levels of the factor variable
levels(ncb$Age.bracket)
## [1] "Above 30"     "Less than 30"
#To find the total number missing values in data
sum(is.na(ncb))  
## [1] 46
#Determining outliers
ncb$BirthWeightGm %>%  boxplot(main="Box Plot of NCB", ylab="Birth weight", col = "grey")

#Finding Headers of data frame
head(ncb)

Decsriptive Statistics Cont.

– From the presented data, we compared if the kids were born healthy for mother’s age group above 30 or less than 30.

 above30<- ncb %>% filter(Age.bracket=="Above 30")


#Summary Stats for both age groups

tablencb <- ncb %>% group_by(Age.bracket) %>% summarise(Min = min(BirthWeightGm,na.rm = TRUE),
                                           Q1 = quantile(BirthWeightGm,probs = .25,na.rm = TRUE),
                                           Median = median(BirthWeightGm, na.rm = TRUE),
                                           Q3 = quantile(BirthWeightGm,probs = .75,na.rm = TRUE),
                                           Max = max(BirthWeightGm,na.rm = TRUE),
                                           Mean = mean(BirthWeightGm, na.rm = TRUE),
                                           SD = sd(BirthWeightGm, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(BirthWeightGm)))



#Stats in detail
tablencb
#Distribution Fitting

graph <- hist (above30$BirthWeightGm, xlab = "Child weight for Mother above 30", ylab = "Frequency", main
 ="Histogram of Empirical dist. for child weight above 30", col = "skyblue", breaks = 15, labels = TRUE)

#Stats for 30 and above

above30$BirthWeightGm %>% summary () 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     652    3062    3402    3372    3771    4791
plot(c(1:12), dnorm(x=c(1:12), mean = 3372, sd = 609.04), type = "p", main = "Normal Distribution child birth for above 30 age group", col = "red", lwd = 2) 


lines(c(1:12),dnorm(x = c(1:12),mean =3372, sd = 609.04), type = "h", col = "Blue", lwd =3)

##Tests to confirm that data follows normal distribution


qqnorm(above30$BirthWeightGm, main = "Q-Q plot for above 30", col ="blue")

shapiro.test(above30$BirthWeightGm)
## 
##  Shapiro-Wilk normality test
## 
## data:  above30$BirthWeightGm
## W = 0.96772, p-value = 4.157e-09
####### Testing for child weight for mother's below 30



below30 <- ncb %>% filter(Age.bracket=="Less than 30")


graph <- hist (below30$BirthWeightGm, xlab = "Child weight for Mother below 30", ylab = "Frequency", main
 ="Histogram of Child weight for mother below 30", col = "green", breaks = 15, labels = TRUE)

below30$BirthWeightGm %>% summary() 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   340.2  2976.8  3316.9  3254.9  3628.8  5131.4
plot(c(1:12), dnorm(x=c(1:12), mean = 3255, sd = 642.1), type = "p", main = "Normal Distribution for below 30 mothers", col = "red", lwd = 2) 


lines(c(1:12),dnorm(x = c(1:12),mean =3255, sd = 642.1), type = "h", col = "green", lwd =3)

##Tests to confirm that data follows normal distribution


qqnorm(below30$BirthWeightGm, main = "Q-Q plot for below 30", col ="green")

shapiro.test(below30$BirthWeightGm)
## 
##  Shapiro-Wilk normality test
## 
## data:  below30$BirthWeightGm
## W = 0.92603, p-value < 2.2e-16
knitr::kable(tablencb)
Age.bracket Min Q1 Median Q3 Max Mean SD n Missing
Above 30 652.05 3061.80 3402.00 3770.55 4791.15 3371.521 609.0403 506 0
Less than 30 340.20 2976.75 3316.95 3628.80 5131.35 3254.934 642.0728 944 0

Hypothesis Testing

(mean(above30$BirthWeightGm)-mean(below30$BirthWeightGm))%>%round(2)
## [1] 116.59
t.test (ncb$BirthWeightGm ~ ncb$Age.bracket, data = ncb, var.equal = TRUE, alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  ncb$BirthWeightGm by ncb$Age.bracket
## t = 3.3548, df = 1448, p-value = 0.0008147
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   48.4170 184.7561
## sample estimates:
##     mean in group Above 30 mean in group Less than 30 
##                   3371.521                   3254.934

Discussion

Strengths of Investigation:

  1. Missing data and outliers were detected and were manageable too.

  2. Data set was quite handy to use and perform various tests with ease to share hypothesis.

  3. To use tests we needed sample from each of the two populations. For the tests to be valid the samples must be random, but they are independent. We picked Welch two sample t.test to perform hypothesis test and conclude our findings that we failed to reject null hyporthesis.

.

References