North Carolina births

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Exploratory analysis

Load the nc data set into our workspace.

  1. What are the cases in this data set? How many cases are there in our sample?

Each case represents birth of a child. There are 1000 casese in the dataset.

  1. Make a side-by-side boxplot of habit and weight. What does the plot highlight about the relationship between these two variables?
nc_no_Na <- na.omit(nc)
ggplot(nc_no_Na, aes(x=nc_no_Na$habit, y=nc_no_Na$weight) ) +
  geom_boxplot()

Mean birth weight of the non smoker’s babies are higher than smoker’s babies. For non smokers, there are alot of ouliers that requires further analysis.

Inference

  1. Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions. You can compute the group size using the same by command above but replacing mean with length.
by(nc$weight, nc$habit, length)
## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126
ggplot(nc, aes(x=nc$weight)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Conditions for inference are satisified. Data is random and apperas to be normally distributed and sample size of 1000 is probably less than 10% of the population.

  1. Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

H0(Null Hypothesis) : Average weights of babies born to smoking and non-smoking mothers are same. HA(Alternative Hypothesis) : Average weights of babies born to smoking and non-smoking mothers are not the same. We could also do one sided test, Average weights of babies born to smoking mother is less than babies born to non-smoking mothers.

  1. Change the type argument to "ci" to construct and record a confidence interval for the difference between the weights of babies born to smoking and non-smoking mothers.
inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )

On your own

inference(y = nc$weeks,  est = "mean", type = "ci", method = "theoretical",conflevel = .95)
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )

We are 95 % confident that average length of pregnaices is between 38.1528 and 38.5165 weeks.

inference(y = nc$weeks,  est = "mean", type = "ci", method = "theoretical",conflevel = .90)
## Single mean 
## Summary statistics:

## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

H0(Null Hypothesis) : Average weights of babies born to mature and non mature mothers are same. HA(Alternative Hypothesis) : Average weights of babies born to mature and non mature mothers are not same.

inference(y = nc$weight, x = nc$marital, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_married = 386, mean_married = 6.8007, sd_married = 1.6118
## n_not married = 613, mean_not married = 7.2958, sd_not married = 1.4027

## Observed difference between means (married-not married) = -0.4951
## 
## Standard error = 0.0997 
## 95 % Confidence interval = ( -0.6905 , -0.2997 )

We dont reject the null hypothesis since pvalue is greater than .05.

by(nc$mage, nc$mature, summary)
## nc$mature: mature mom
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   35.00   37.00   37.18   38.00   50.00 
## -------------------------------------------------------- 
## nc$mature: younger mom
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   21.00   25.00   25.44   30.00   34.00

From that we could look at the min of mature mom and the max age of younger mom to get the cut off ages.

Is there any relationship between mature/younger moms and the length of pregnancy. mature is the categorical vairable and weight is the numerical variable.

H0(Null Hypothesis) : Average length of pregnancy is same for mature and non mature moms. HA(Alternative Hypothesis) : Average length of pregnancy is not the same for mature and non mature moms.

inference(y = nc$weeks, x = nc$mature, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 132, mean_mature mom = 38.0227, sd_mature mom = 3.2184
## n_younger mom = 866, mean_younger mom = 38.3822, sd_younger mom = 2.8844
## Observed difference between means (mature mom-younger mom) = -0.3595
## 
## H0: mu_mature mom - mu_younger mom = 0 
## HA: mu_mature mom - mu_younger mom != 0 
## Standard error = 0.297 
## Test statistic: Z =  -1.211 
## p-value =  0.2258

we fail to reject the null hypothesis since pvalue is greather than .05. We are 95% confident that the average lenght of pregrancy for mature and non mature mothers are the same.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.