Aditi Sharma- S3795604
Last updated: 27 October, 2019
This data set is regarding child birth in North California in 2001. Data shows that maternal health factors are a leading contributor to birth outcomes such as fetal viability and infant mortality.
In North Carolina, barriers to affordable and consistent healthcare for women pre- and postconception contribute to stubbornly high rates of fetal and infant death each year, despite advances in clinical care.
Below is the data consist of North Carolina Birth Records in 2001.
We will look through the data and analyse the birth weight of child basis their mom’s age which is categorize in “below 30” & “30 and above”.
Our problem statement for Null hypothesis says that mothers with the age group above 30 are more likely to give birth to healthy child.
Down the assignment program we will perform various testing methods to prove our statement.
https://vincentarelbundock.github.io/Rdatasets/doc/Stat2Data/NCbirths.html
Data: NCBirths.csv Variables: 16 Observations:1451
This dataset contains data on a sample of 1451 birth records that statistician John Holcomb selected from the North Carolina State Center for Health and Environmental Statistics.
We have a column “Age.Bracket” that we will convert to factor and will find the levels and labels.
In the original structure we do not have factor variable, however we will convert 1 variable to factor.
Description of variables
• ID Patient ID code • Plural 1=single birth, 2=twins, 3=triplets • Sex Sex of the baby 1=male 2=female • MomAge: Mother’s age (in years) • Weeks (Completed weeks of gestation) • Marital(Marital status): 1=married or 2=not married • RaceMom (Mother’s race): 1=white, 2=black, 3=American Indian, 4=Chinese 5=Japanese, 6=Hawaiian, 7=Filipino, or 8=Other Asian or Pacific Islander
• HispMom(Hispanic origin of mother): C=Cuban, M=Mexican, N=not Hispanic O= Other Hispanic, P=Puerto Rico, S=Central/South America
• Gained: Weight gained during pregnancy (in pounds) • Smoke(Smoker mom?) 1=yes or 0=no • Birth(WeightOz): Birth weight in ounces • Birth(WeightGm): Birth weight in grams • Low(Indicator for low birth weight):: 1=2500 grams or less • Premie Indicator for premature birth, 1=36 weeks or sooner • MomRace(Mother’s race): black, hispanic, other, or white
– In this section, we will calculate descriptive statistics (i.e., mean, median, standard deviation, first and third quartile, interquartile range, minimum and maximum values)
– We located less than 3% missing data, which according to data preprocessing is safe to exclude using na.rm=TRUE from the next section onwards.
– We used univariate way to find the outliers using box plot.
## [1] "C:/Users/aditi/OneDrive/Documents/Course/Intro to Statistics/Assignment 3"
ncb <- read.csv("NCbirths.csv", stringsAsFactors = FALSE)
#Determining the structue of data frame
str(ncb)## 'data.frame': 1450 obs. of 16 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Plural : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : int 1 2 1 1 1 1 2 2 2 2 ...
## $ MomAge : int 32 32 27 27 25 28 25 15 21 27 ...
## $ Age.bracket : chr "Above 30" "Above 30" "Less than 30" "Less than 30" ...
## $ Weeks : int 40 37 39 39 39 43 39 42 39 40 ...
## $ Marital : int 1 1 1 1 1 1 1 2 1 2 ...
## $ RaceMom : int 1 1 1 1 1 1 1 1 1 1 ...
## $ HispMom : chr "N" "N" "N" "N" ...
## $ Gained : int 38 34 12 15 32 32 75 25 28 37 ...
## $ Smoke : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BirthWeightOz: int 111 116 138 136 121 117 143 113 120 124 ...
## $ BirthWeightGm: num 3147 3289 3912 3856 3430 ...
## $ Low : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Premie : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MomRace : chr "white" "white" "white" "white" ...
## [1] "data.frame"
#Factorized the variable
ncb$Age.bracket <- as.factor(ncb$Age.bracket)
#Determining class of factor variable
class(ncb$Age.bracket)## [1] "factor"
## [1] "Above 30" "Less than 30"
## [1] 46
#Determining outliers
ncb$BirthWeightGm %>% boxplot(main="Box Plot of NCB", ylab="Birth weight", col = "grey")– From the presented data, we compared if the kids were born healthy for mother’s age group above 30 or less than 30.
above30<- ncb %>% filter(Age.bracket=="Above 30")
#Summary Stats for both age groups
tablencb <- ncb %>% group_by(Age.bracket) %>% summarise(Min = min(BirthWeightGm,na.rm = TRUE),
Q1 = quantile(BirthWeightGm,probs = .25,na.rm = TRUE),
Median = median(BirthWeightGm, na.rm = TRUE),
Q3 = quantile(BirthWeightGm,probs = .75,na.rm = TRUE),
Max = max(BirthWeightGm,na.rm = TRUE),
Mean = mean(BirthWeightGm, na.rm = TRUE),
SD = sd(BirthWeightGm, na.rm = TRUE),
n = n(),
Missing = sum(is.na(BirthWeightGm)))
#Stats in detail
tablencb#Distribution Fitting
graph <- hist (above30$BirthWeightGm, xlab = "Child weight for Mother above 30", ylab = "Frequency", main
="Histogram of Empirical dist. for child weight above 30", col = "skyblue", breaks = 15, labels = TRUE)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 652 3062 3402 3372 3771 4791
plot(c(1:12), dnorm(x=c(1:12), mean = 3372, sd = 609.04), type = "p", main = "Normal Distribution child birth for above 30 age group", col = "red", lwd = 2)
lines(c(1:12),dnorm(x = c(1:12),mean =3372, sd = 609.04), type = "h", col = "Blue", lwd =3)##Tests to confirm that data follows normal distribution
qqnorm(above30$BirthWeightGm, main = "Q-Q plot for above 30", col ="blue")##
## Shapiro-Wilk normality test
##
## data: above30$BirthWeightGm
## W = 0.96772, p-value = 4.157e-09
####### Testing for child weight for mother's below 30
below30 <- ncb %>% filter(Age.bracket=="Less than 30")
graph <- hist (below30$BirthWeightGm, xlab = "Child weight for Mother below 30", ylab = "Frequency", main
="Histogram of Child weight for mother below 30", col = "green", breaks = 15, labels = TRUE)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 340.2 2976.8 3316.9 3254.9 3628.8 5131.4
plot(c(1:12), dnorm(x=c(1:12), mean = 3255, sd = 642.1), type = "p", main = "Normal Distribution for below 30 mothers", col = "red", lwd = 2)
lines(c(1:12),dnorm(x = c(1:12),mean =3255, sd = 642.1), type = "h", col = "green", lwd =3)##Tests to confirm that data follows normal distribution
qqnorm(below30$BirthWeightGm, main = "Q-Q plot for below 30", col ="green")##
## Shapiro-Wilk normality test
##
## data: below30$BirthWeightGm
## W = 0.92603, p-value < 2.2e-16
| Age.bracket | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Above 30 | 652.05 | 3061.80 | 3402.00 | 3770.55 | 4791.15 | 3371.521 | 609.0403 | 506 | 0 |
| Less than 30 | 340.20 | 2976.75 | 3316.95 | 3628.80 | 5131.35 | 3254.934 | 642.0728 | 944 | 0 |
## [1] 116.59
##
## Two Sample t-test
##
## data: ncb$BirthWeightGm by ncb$Age.bracket
## t = 3.3548, df = 1448, p-value = 0.0008147
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 48.4170 184.7561
## sample estimates:
## mean in group Above 30 mean in group Less than 30
## 3371.521 3254.934
We assumed normality as n > 30 and variance is equal.
Estimated difference between means: 116.59
95% CI of difference between 48.4170 184.7561
Mean of child weight is not same for both the age groups.
p-value~0.0008 which is less than significance level.
It is concluded that child weight is more if mother belongs to age group 30 or above. We ran this conclusion through 1450 observations in a statistical data, hence we can say we failed to reject Null Hypothesis.
Strengths of Investigation:
Missing data and outliers were detected and were manageable too.
Data set was quite handy to use and perform various tests with ease to share hypothesis.
To use tests we needed sample from each of the two populations. For the tests to be valid the samples must be random, but they are independent. We picked Welch two sample t.test to perform hypothesis test and conclude our findings that we failed to reject null hyporthesis.
.