Exercise 1

What are the cases in this data set? How many cases are there in our sample? There are 1000 cases in the sample. The cases are infants born in North Carolina during 2004 and the data contains information about the infants at birth and the mother and her behavior during pregnancy. The father’s age is included but has 17% missing values. A few other variables have a small percentage of missing values.

summary(nc)
##       fage            mage            mature        weeks             premie   
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00   full term:846  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00   premie   :152  
##  Median :30.00   Median :27                     Median :39.00   NA's     :  2  
##  Mean   :30.26   Mean   :27                     Mean   :38.33                  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00                  
##  Max.   :55.00   Max.   :50                     Max.   :45.00                  
##  NA's   :171                                    NA's   :2                      
##      visits            marital        gained          weight      
##  Min.   : 0.0   married    :386   Min.   : 0.00   Min.   : 1.000  
##  1st Qu.:10.0   not married:613   1st Qu.:20.00   1st Qu.: 6.380  
##  Median :12.0   NA's       :  1   Median :30.00   Median : 7.310  
##  Mean   :12.1                     Mean   :30.33   Mean   : 7.101  
##  3rd Qu.:15.0                     3rd Qu.:38.00   3rd Qu.: 8.060  
##  Max.   :30.0                     Max.   :85.00   Max.   :11.750  
##  NA's   :9                        NA's   :27                      
##  lowbirthweight    gender          habit          whitemom  
##  low    :111    female:503   nonsmoker:873   not white:284  
##  not low:889    male  :497   smoker   :126   white    :714  
##                              NA's     :  1   NA's     :  2  
##                                                             
##                                                             
##                                                             
## 

Exercise 2 Side-by-Side boxplot:

boxplot(nc$weight ~ nc$habit, col = c('seashell1', 'seashell3'),
        ylab = "infant weight (lbs)", xlab = NULL,
        main = "infant birth weight by mother smoking habit")

by <- by(nc$weight, nc$habit, mean)

Mean infant birth weight where mother non-smoker = 7.1442726
Mean infant birth weight where mother smoker = 6.8287302

Population QQ plots, infant birth weight for non-smoker/smoker mother:

par(mfrow=c(1,2), mar = c(4,2,0.5,0.5))

nonsmoker <- nc %>% 
  filter(habit == 'nonsmoker')
smoker <- nc %>% 
  filter(habit == 'smoker')

qqnorm(nonsmoker$weight, main = 'birth weight - nonsmoker')
qqline((nonsmoker$weight))

qqnorm(smoker$weight, main = 'birth weight - smoker')
qqline((smoker$weight))

Exercise 3

Check if the conditions necessary for inference are satisfied. Note that you will need to obtain sample sizes to check the conditions.

1. The cases are independent.
2. The sample sizes are > 30 for both non-smokers = 873 and smokers = 126.

Exercise 4

Write the hypotheses for testing if the average weights of babies born to smoking and non-smoking mothers are different.

\(H_0\) : Average weight of infants born to non-smoking mothers and smoking mothers is the same.
\(H_A\) : Average weight of infants born to non-smoking mothers and smoking mothers is different.
\(H_0\) : \(\mu_{nonsmoker} - \mu_{smoker} = 0\)
\(H_A\) : \(\mu_{nonsmoker} - \mu_{smoker} \ne 0\)

Exercise 5 Hypothesis test and confidence interval

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ht", null = 0, 
          alternative = "twosided", method = "theoretical")
## Warning: package 'BHH2' was built under R version 4.0.5
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

The p-value 0.0184 < 0.05, so we reject the null hypothesis. The average birth weight of children born in North Carolina to non-smoking mothers in North Carolina during 2004 is different than the average birth weight of children born to smoking mothers.

inference(y = nc$weight, x = nc$habit, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## 
## Standard error = 0.1338 
## 95 % Confidence interval = ( 0.0534 , 0.5777 )
title(main = "Infant birth weight", line = 3)

We are 95% confident that the average birth weight of children born in North Carolina during 2004 to non-smoking mothers is between 0.0534 and 0.578 lbs greater than the average birth weight of children born in North Carolina to smoking mothers.

1. Calculate a 95% confidence interval for the average length of pregnancies (weeks) and interpret it in context.

inference(y = nc$weeks, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Single mean 
## Summary statistics:
## mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 95 % Confidence interval = ( 38.1528 , 38.5165 )
title(main = "Average length of pregnancies (weeks)")

With 95% certainty, the average length of pregnancies in North Carolina resulting in a birth during 2004 was between 38.153 and 38.517 weeks. The average pregnancy in NC in 2004 would be considered an early term pregnancy.

2. Calculate a new confidence interval for the same parameter at the 90% confidence level.

inference(y = nc$weeks, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical", conflevel = 0.90, eda_plot = FALSE)
## Single mean 
## Summary statistics: mean = 38.3347 ;  sd = 2.9316 ;  n = 998 
## Standard error = 0.0928 
## 90 % Confidence interval = ( 38.182 , 38.4873 )

3. Conduct a hypothesis test evaluating whether the average weight gained by younger mothers is different than the average weight gained by mature mothers.

inference(y = nc$gained, x = nc$mature, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_mature mom = 129, mean_mature mom = 28.7907, sd_mature mom = 13.4824
## n_younger mom = 844, mean_younger mom = 30.5604, sd_younger mom = 14.3469
## Observed difference between means (mature mom-younger mom) = -1.7697
## 
## Standard error = 1.2857 
## 95 % Confidence interval = ( -4.2896 , 0.7502 )
title(main = "Mother weight gain", line = 3)

The 95% confidence interval contains 0, so there is not sufficient evidence that the average weight gain by mature mothers is different from the average weight gain by younger mothers.

4.Determine the age cutoff for younger and mature mothers. Use a method of your choice, and explain how your method works.

The quick way: sort NC by mage and see where younger ends and mature begins, cutoff \(\ge\) 35 is a mature mom. Using code:

youngmax <- filter(nc, nc$mature == "younger mom")
youngmax <- max(youngmax$mage)


maturemin <- filter(nc, nc$mature == "mature mom")
maturemin <- min(maturemin$mage)

A younger mom is <= 34, a mature mom is >= 35

5. Is there a difference in birth weight between boys and girls?

inference(y = nc$weight, x = nc$gender, est = "mean", type = "ci", null = 0, 
          alternative = "twosided", method = "theoretical")
## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_female = 503, mean_female = 6.9029, sd_female = 1.4759
## n_male = 497, mean_male = 7.3015, sd_male = 1.5168

## Observed difference between means (female-male) = -0.3986
## 
## Standard error = 0.0947 
## 95 % Confidence interval = ( -0.5841 , -0.2131 )

The CI range is less than 0, so the difference in \(\mu_{g} - \mu_{b}\) shows that with 95% confidence the average birth weight of girls in North Carolina during 2004 was between 0.213 and 0.584 lbs less than the average birth weight of boys in NC during 2004.