Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.


Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)
## Warning: package 'openintro' was built under R version 3.3.3
# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset

mn_gained <- mean(ncbirths$gained, na.rm = TRUE)

sd_gained <- sd(ncbirths$gained, na.rm = TRUE)

table(is.na(ncbirths$gained))
## 
## FALSE  TRUE 
##   973    27
#From the table, we see there are 973 data points in the vector for weight gained

sz_gained <- 973
# Calculate t-critical value for 90% confidence

t90_gained <- qt(0.05, df = (sz_gained-1), lower.tail = TRUE)  
round(t90_gained, 4)  
## [1] -1.6464

The negative t-value for a 90% confidence interval with 972 degrees of freedom is -1.6464.

# Calculate margin of error
ME90_gained <- abs(t90_gained)*sd_gained/sqrt(sz_gained)  
round(ME90_gained, 4)
## [1] 0.7517

The margin of error is 0.7517.

# Boundaries of confidence interval
#The lower bound is
mn_gained - ME90_gained
## [1] 29.57411
#The upper bound is
mn_gained + ME90_gained
## [1] 31.07748

The 90% confidence interval for the average weight gained by North Carolina mothers is (29.57, 31.08) lbs.

Question 2 - Single Sample t-confidence interval

  1. Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.
  2. How does that confidence interval compare to the one in Question #1?
#To find the 95% confidence interval, first find the t-score that corresponds to 95%.
t95_gained <- qt(0.025, df = (sz_gained - 1))

#Then calculate the new bounds.
#Lower:
mn_gained - abs(t95_gained)*sd_gained/sqrt(sz_gained)
## [1] 29.42985
#Upper:
mn_gained + abs(t95_gained)*sd_gained/sqrt(sz_gained)
## [1] 31.22174

The 95% confidence interval for the average weight gained for North Carolina mothers is (29.43, 31.22) lbs.

The 95% confidence interval is larger than the 90% confidence interval, reflecting our greater confidence that the true mean weight gain of all North Carolina mothers lies within the bounds of 95% confidence interval.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

  1. Write hypotheses
    \(H_O: \mu_{NC} = \mu_E = 7.7 lbs.\)
    \(H_A: \mu_{NC} \ne 7.7 lbs.\)

  2. Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mn_weight <- mean(ncbirths$weight, na.rm = TRUE)
sd_weight <- mean(ncbirths$weight)

table(is.na(ncbirths$weight))
## 
## FALSE 
##  1000
#There are no missing data from the baby weight column, so the sample size is 1000.
sz_weight <- 1000
# Test statistic
t95_weight <- (mn_weight - 7.7)/(sd_weight/sqrt(sz_weight))
# Probability of test statistic by chance
pt(abs(t95_weight), df = sz_weight-1, lower.tail = FALSE)
## [1] 0.003882509

The probability of a getting a sample mean of 7.1 lbs. when the true population mean is 7.7 lbs. is 0.00388. This is smaller than the significance level of 0.05, so we reject the null hypothesis in favor of the alternate hypothesis.
c. Conclusion
The data suggests that the mean birth weight of babies born in North Carolina differs from the mean birth weight of European babies, 7.7 lbs.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

  1. Write hypotheses
    \(H_0: \bar{x}_{mage} - \bar{x}_{fage} = 0\)
    \(H_A: \bar{x}_{mage} - \bar{x}_{fage} \ne 0\)
    two-tailed

  2. Test by confidence interval or p-value and decision

t.test(ncbirths$fage, ncbirths$mage)
## 
##  Welch Two Sample t-test
## 
## data:  ncbirths$fage and ncbirths$mage
## t = 10.631, df = 1701.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.655048 3.856411
## sample estimates:
## mean of x mean of y 
##  30.25573  27.00000

The p-value is less than \(\alpha\) = 0.05, therefore we reject \(H_0\).

  1. Conclusion
    The data suggests that the mean age of mothers is different from the mean age of fathers.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

  1. Write hypotheses
    \(H_0: \bar{x}_{weeksS} - \bar{x}_{weeksNS} = 0\)
    \(H_A: \bar{x}_{weeksS} - \bar{x}_{weeksNS} \ne 0\)
    two-tailed

  2. Test by confidence interval or p-value and decision

#First create two subsets. One for smokers, a second for nonsmokers  
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

#Then use the t.test to find the p-value
t.test(smokers$weeks, nonsmokers$weeks)
## 
##  Welch Two Sample t-test
## 
## data:  smokers$weeks and nonsmokers$weeks
## t = 0.519, df = 182.63, p-value = 0.6044
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3519903  0.6032646
## sample estimates:
## mean of x mean of y 
##  38.44444  38.31881

The p-value is larger than a significance level of 0.05 and the 95% confidence interval includes zero. Therefore, we fail to reject the null hypothesis.

  1. Conclusion
    The data does not suggest that there is a difference in length of pregnancy between smoking mothers and nonsmoking mothers.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

  1. Write hypotheses
    \(H_0: \bar{x}_{gainedYM} - \bar{x}_{gainedMM} = 0\)
    \(H_A: \bar{x}_{gainedYM} - \bar{x}_{gainedMM} \gt 0\)
    one-tailed
  2. Test by confidence interval or p-value and decision
#First create two subsets. One for younger moms, a second for mature moms  
younger <- subset(ncbirths, ncbirths$mature == "younger mom")
mature <- subset(ncbirths, ncbirths$mature == "mature mom")

#Then use the t.test to find the p-value
t.test(younger$gained, mature$gained, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  younger$gained and mature$gained
## t = 1.3765, df = 175.34, p-value = 0.08521
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.3562741        Inf
## sample estimates:
## mean of x mean of y 
##  30.56043  28.79070

The p-value is greater than \(\alpha\) = 0.05, so we fail to reject the null hypothesis.

  1. Conclusion
    The data suggests that weight gained by younger mothers is not more than the weight gained by mature mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

#Let's create two subsets, one for "low" weight babies and a second for "not low" weight babies.  
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")

notlow <- subset(ncbirths, ncbirths$lowbirthweight == "not low")

#Generate a summary of each subset
fivenum(low$weight)
## [1] 1.000 3.095 4.560 5.160 5.500
fivenum(notlow$weight)
## [1]  5.56  6.75  7.44  8.13 11.75

The heaviest “low” baby weighed 5.5 lbs, while the lightest “not low” baby weighed 5.56 lbs. Based on those numbers the cut off is between 5.5 and 5.56 lbs, inclusive. The value that makes the most sense to me is 5.5 lbs. Babies weighing 5.5 lbs or less are classified as “low”, while babies weighing more than 5.5 lbs are classified as “not low”.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

  1. Question
    Are boys’ birth weights higher than girls’ birth weights at the 0.01 significance level?

  2. Write hypotheses
    \(H_0: \bar{x}_{weightB} - \bar{x}_{weightG} = 0\)
    \(H_A: \bar{x}_{weightB} - \bar{x}_{weightG} \gt 0\)
    one-tailed

  3. Test by confidence interval or p-value and decision

#First create two subsets. One for boys, another for girls  
boys <- subset(ncbirths, ncbirths$gender == "male")
girls <- subset(ncbirths, ncbirths$gender == "female")  

#Then use the t.test to find the p-value
t.test(boys$weight, girls$weight, alternative = "greater", conf.level = 0.99)
## 
##  Welch Two Sample t-test
## 
## data:  boys$weight and girls$weight
## t = 4.2113, df = 996.45, p-value = 1.384e-05
## alternative hypothesis: true difference in means is greater than 0
## 99 percent confidence interval:
##  0.1780681       Inf
## sample estimates:
## mean of x mean of y 
##  7.301509  6.902883

The p-value is less than \(\alpha\) = 0.01, so we reject the null hypothesis.

  1. Conclusion
    We are 99% confident that the mean boys’ birth weight of 7.3 lbs. is larger than the mean girls’ birth weight of 6.9 lbs.