Project #5 - Inference on Numerical Data

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, standard error, and sample size of dataset
wg_mean <- mean(ncbirths$gained, na.rm = TRUE); # Mean
wg_sd <- sd(ncbirths$gained, na.rm = TRUE); # Standard Deviation
wg_ss <- table(is.na(ncbirths$gained))[["FALSE"]] # Sample Size
wg_se <- wg_sd/sqrt(wg_ss) # Standard Error

# Calculate t-critical value for 90% confidence
wg_tCrit90 <- qt(p=0.95, df = wg_ss - 1)

# Calculate margin of error
wg_errorMargin90 <- wg_tCrit90 * wg_se

# Boundaries of confidence interval
paste("We are 90% confident that the true average weight gained by North Carolina mothers is between", round(wg_mean - wg_errorMargin90, 2), "and", round(wg_mean + wg_errorMargin90, 2), "pounds.")

## [1] "We are 90% confident that the true average weight gained by North Carolina mothers is between 29.57 and 31.08 pounds."

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

# Calculate t-critical value for 90% confidence
wg_tCrit95 <- qt(p=0.975, df = wg_ss - 1)

# Calculate margin of error
wg_errorMargin95 <- wg_tCrit95 * wg_se

# Boundaries of confidence interval
paste("We are 95% confident that the true average weight gained by North Carolina mothers is between", round(wg_mean - wg_errorMargin95, 2), "and", round(wg_mean + wg_errorMargin95, 2), "pounds.")

## [1] "We are 95% confident that the true average weight gained by North Carolina mothers is between 29.43 and 31.22 pounds."

How does that confidence interval compare to the one in Question #1?
It is wider than the interval in question one, as a wider interval is more likely to contain the true mean.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
\[H_0: \mu = 7.7\text{lbs}\] \[H_A: \mu \neq 7.7\text{lbs}\] It is a two-tailed test
Test by p-value and decision

# Sample statistics (sample mean, standard deviation, standard error, and size)
weight_mean <- mean(ncbirths$weight, na.rm = TRUE); # Mean
weight_sd <- sd(ncbirths$weight, na.rm = TRUE); # Standard Deviation
weight_ss <- table(is.na(ncbirths$weight))[["FALSE"]] # Sample Size
weight_se <- wg_sd/sqrt(wg_ss) # Standard Error

# Test statistic
t_stat <- (weight_mean - 7.7) / weight_se

# Probability of test statistic by chance
cat("The probability of our observed mean appearing by change if the true mean is 7.7lbs is ", pt(t_stat, df=weight_ss-1)*2*100, "%", sep="")

## The probability of our observed mean appearing by change if the true mean is 7.7lbs is 18.98217%

Conclusion
As the probability is greater than 5%, we fail to reject the null hypothesis
There is not sufficient evidence that the average birthweight in North Carolina is different from that in Europe.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
\[H_0: \mu_d = 0\] \[H_A: \mu_d \neq 0\]
Test by confidence interval or p-value and decision

fages <- subset(ncbirths, !is.na(fage))$fage
mages <- subset(ncbirths, !is.na(mage))$mage
# Get the means
fages_mean <- mean(fages)
mages_mean <- mean(mages)
# Extract the sample sizes
fages_ss <- length(fages)
mages_ss <- length(mages)
# Extract standard deviations
fages_sd <- sd(fages)
mages_sd <- sd(mages)

# Calculate the total standard error
se <- sqrt((fages_sd^2/fages_ss)+(mages_sd^2/mages_ss))

# Calculate out point estimate
pe <- (fages_mean - mages_mean)

# Find the test statistic from our point estimage and standard error
t <- pe / se

# Calculate the probability of our test statistic given our sample sizes
pt(abs(t), df=min(c(fages_ss, mages_ss))-1, lower.tail = FALSE)*2

## [1] 7.862143e-25

Conclusion
We have sufficient evidence to reject our null hypothesis in favor of the alternative.
There is sufficient evidence to support that the average North Carolina mothers age is different from the average fathers age.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
\[H_0: \mu_{\text{ns}} = \mu_{\text{s}}\] \[H_A: \mu_{\text{ns}} = \mu_{\text{s}}\]
Test by confidence interval or p-value and decision

smokersBL <- subset(ncbirths, habit == "smoker" & !is.na(weeks))$weeks
nonsmokersBL <- subset(ncbirths, habit == "nonsmoker" & !is.na(weeks))$weeks
# Get the means
smokersBL_mean <- mean(smokersBL)
nonsmokersBL_mean <- mean(nonsmokersBL)
# Extract the sample sizes
smokersBL_ss <- length(smokersBL)
nonsmokersBL_ss <- length(nonsmokersBL)
# Extract standard deviations
smokersBL_sd <- sd(smokersBL)
nonsmokersBL_sd <- sd(nonsmokersBL)

# Calculate the total standard error
se <- sqrt((nonsmokersBL_sd^2/nonsmokersBL_ss)+(smokersBL_ss^2/smokersBL_ss))

# Calculate out point estimate
pe <- (smokersBL_mean - nonsmokersBL_mean)

# Find the test statistic from our point estimage and standard error
t <- pe / se

# Calculate the probability of our test statistic given our sample sizes
pt(abs(t), df=min(c(fages_ss, mages_ss))-1, lower.tail = FALSE)*2

## [1] 0.9910728

Conclusion
We must fail to reject our null hypothesis
There is not sufficient evidence to support that the average length of pregnancy between smokers and non-smokers is different.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\[H_0: \mu_y - \mu_m = 0\] \[H_A: \mu_y - \mu_m > 0\] One-tailed test
Test by confidence interval or p-value and decision

# Extract our samples and the relevant data about them
youngerGained <- subset(ncbirths, mature == "younger mom" & !is.na(gained))$gained
youngerGained_mean <- mean(youngerGained)
youngerGained_sd <- sd(youngerGained)
youngerGained_ss <- length(youngerGained)

matureGained <- subset(ncbirths, mature == "mature mom" & !is.na(gained))$gained
matureGained_mean <- mean(matureGained)
matureGained_sd <- sd(matureGained)
matureGained_ss <- length(matureGained)

# Calculate our standard error
se <- sqrt((youngerGained_sd^2/youngerGained_ss)+(matureGained_sd^2/matureGained_ss))

# Calculate our point estimate and test statistic
pe <- youngerGained_mean - matureGained_mean
t <- pe / se

pt(t, df=min(c(youngerGained_ss, matureGained_ss))-1,lower.tail = FALSE)

## [1] 0.08553767

Conclusion
As our p-value is greater than .05 we fail to reject the null hypothesis.
We do not have sufficient evidence to suggest that younger mothers gain more weight than mature mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

max(ncbirths[ncbirths$lowbirthweight == "low", ]$weight) # Print the largest number considered "low"

## [1] 5.5

min(ncbirths[ncbirths$lowbirthweight == "not low", ]$weight) # Print the smallest number considered "not low"

## [1] 5.56

It is reasonable to assume that the cutoff between “low” and “not low” babies is 5.5lbs. This number is the value of the largest low number and is just below smallest high number. It is also a number which is unlikely to happen by chance and is quite likely to be picked by a human as it appears to be a “nice” number.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
Is there a difference in birth weight between mature and young mothers.
Write hypotheses
\[H_0: \mu_y - \mu_m = 0\] \[H_A: \mu_y - \mu_m \neq 0\] Two-tailed test
Test by confidence interval or p-value and decision

# Extract our samples and the relevant data about them
youngerWeight <- subset(ncbirths, mature == "younger mom" & !is.na(weight))$weight
youngerWeight_mean <- mean(youngerWeight)
youngerWeight_sd <- sd(youngerWeight)
youngerWeight_ss <- length(youngerWeight)

matureWeight <- subset(ncbirths, mature == "mature mom" & !is.na(weight))$weight
matureWeight_mean <- mean(matureWeight)
matureWeight_sd <- sd(matureWeight)
matureWeight_ss <- length(matureWeight)

# Calculate our standard error
se <- sqrt((youngerWeight_sd^2/youngerWeight_ss)+(matureWeight_sd^2/matureWeight_ss))

# Calculate our point estimate and test statistic
pe <- youngerWeight_mean - matureWeight_mean
t <- pe / se

pt(abs(t), df=min(c(youngerWeight_ss, matureWeight_ss))-1,lower.tail = FALSE) * 2

## [1] 0.8528517

Conclusion
As our p-value is quite large we must fail to reject the null hypothesis.
We do not have sufficient evidence to suggest that younger mothers babies weigh differently than mature mothers babies.