In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mean(ncbirths$gained, na.rm = TRUE)
## [1] 30.3258
sd(ncbirths$gained, na.rm = TRUE)
## [1] 14.2413
table(!is.na(ncbirths$gained))[[2]]
## [1] 973
mn <- mean(ncbirths$gained, na.rm = TRUE)
sd <- sd(ncbirths$gained, na.rm = TRUE)
ss <- table(!is.na(ncbirths$gained))[[2]]
# Calculate t-critical value for 90% confidence
qt(0.05, lower.tail = FALSE, df = 972)
## [1] 1.646423
t <- qt(0.05, lower.tail = FALSE, df = 972)
# Calculate margin of error
t*(sd/sqrt(973))
## [1] 0.7516826
me <- t*(sd/sqrt(973))
# Boundaries of confidence interval
sd/sqrt(973)
## [1] 0.456555
se <- sd/sqrt(973)
mn - t * se
## [1] 29.57411
mn + t * se
## [1] 31.07748
# Calculate t-critical value for 95% confidence
qt(0.025, lower.tail = FALSE, df = 972)
## [1] 1.962408
t1 <- qt(0.025, lower.tail = FALSE, df = 972)
# Calculate margin of error
t1*(sd/sqrt(973))
## [1] 0.895947
me <- t1*(sd/sqrt(973))
# Boundaries of confidence interval
mn - t1 * se
## [1] 29.42985
mn + t1 * se
## [1] 31.22174
# 90% confidence interval difference
31.07-29.57
## [1] 1.5
## 95% confidence interval difference
31.22-29.43
## [1] 1.79
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
Write hypotheses
\(H_0: \mu = 7.7\)
\(H_A: \neq 7.7\)
Test by p-value and decision
# Sample statistics (sample mean, standard deviation, and size)
(mn_b <- mean(ncbirths$weight))
## [1] 7.101
(sd_b <- sd(ncbirths$weight))
## [1] 1.50886
table(!is.na(ncbirths$weight))
##
## TRUE
## 1000
# Test statistic
(ts <- (mn_b-7.7) / (sd_b/sqrt(1000)))
## [1] -12.55388
# Probability of test statistic by chance
pt(ts, df = 999)*2
## [1] 1.135415e-33
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
Write hypotheses
\(H_0: \mu = 0\)
\(H_A: \mu \neq 0\)
Test by confidence interval or p-value and decision
# Parent Dataframe
parents <- data.frame(ncbirths$fage, ncbirths$mage)
# Calculate the column of mean differences by also creating diff variable
parents$diff <- parents$ncbirths.fage - parents$ncbirths.mage
# Calculate the test statistic
ts1 <- (mean(parents$diff, na.rm = TRUE)-0) / (sd(parents$diff, na.rm = TRUE)/sqrt(1000))
# Probability of getting that test statistic
pt(ts1, df = 999, lower.tail = FALSE)*2
## [1] 1.840249e-71
Decision p < \(\alpha\), therefore we fail to reject Ho
Conclusion
There is sufficient evidence to support the claim that there is a significant difference between the mean age of mothers and fathers.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)
Two-tailed test
# Create smoker and non-smoker subsets
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
# Calculate the mean difference between smokers and nonsmokers in weeks
md <- mean(smokers$weeks, na.rm = TRUE) - mean(nonsmokers$weeks, na.rm = TRUE)
# Determine the sample sizes of pregnant mom smokers and nonsmokers
table(!is.na(smokers$weeks))
##
## TRUE
## 126
table(!is.na(nonsmokers$weeks))
##
## FALSE TRUE
## 1 872
# Find the standard error of the weeks variable
se_weeks <- sqrt((sd(smokers$weeks, na.rm = TRUE)^2/126) + (sd(nonsmokers$weeks, na.rm = TRUE)^2/872))
# Test statistic
ts2 <- (md - 0)/se_weeks
# Calculate the p-value
pt(-ts2, df = 125)*2
## [1] 0.6046811
Decision
p > \(\alpha\), therefore we fail to reject Ho.
Conclusion There is sufficient evidence to support the claim that there is a significant difference in length of pregnancy between smokers and non-smokers.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
\(H_0: \mu_1 -\mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)
Two-tailed test
# Create mature and younger mom subsets
ym <- subset(ncbirths, ncbirths$mature == "younger mom")
mm <- subset(ncbirths, ncbirths$mature == "mature mom")
# Calculate the mean difference in weights
md_weight <- mean(mm$weight) - mean(ym$weight)
# Determine the sample sizes
table(!is.na(mm$weight))
##
## TRUE
## 133
table(!is.na(ym$weight))
##
## TRUE
## 867
# Find the standard error
se_weight <- sqrt((sd(mm$weight)^2/133)+(sd(ym$weight)^2/867))
# Test statistic
ts_weight <- (md_weight-0)/se_weight
# Calculate the p-value
pt(-ts_weight, df=132)*2
## [1] 0.8528517
Decision
p > \(\alpha\), therefore we fail to reject Ho.
Conclusion
The average weight gained by younger mothers is not significantly more than the average weight gained by mature mothers.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
# Create "low" and "not low" subsets
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")
not_low <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
# Determine the max and min values using the fivenum function
fivenum(low$weight)
## [1] 1.000 3.095 4.560 5.160 5.500
fivenum(not_low$weight)
## [1] 5.56 6.75 7.44 8.13 11.75
This data suggest that a “low’ birth weight is any values at 5.50 lbs or below, a”not low" birth weight is any values at 5.56 or above.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between married mothers and non married mothers.
Write hypotheses
\(\mu_1\) = married mothers
\(\mu_2\) = not married mothers
\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)
Two-tailed test
# Create married and not married subsets
married <- subset(ncbirths, ncbirths$marital == "married")
not_married <- subset(ncbirths, ncbirths$marital == "not married")
# Calculate the mean difference between married and not married mothers
md_marital <- mean(married$weeks, na.rm = TRUE) - mean(not_married$weeks, na.rm = TRUE)
# Determine the sample sizes of pregnant mom smokers and nonsmokers
table(!is.na(married$weeks))
##
## TRUE
## 386
table(!is.na(not_married$weeks))
##
## FALSE TRUE
## 1 612
# Find the standard error of the weeks variable
se_w <- sqrt((sd(married$weeks, na.rm = TRUE)^2/386) + (sd(not_married$weeks, na.rm = TRUE)^2/612))
# Test statistic
ts_w <- (md - 0)/se_w
# Calculate the p-value
pt(-ts_w, df = 385)*2
## [1] 0.5358574
Decision
p > \(\alpha\), therefore we fail to reject Ho.
Conclusion
There is sufficient evidence to support the claim that there is a significant difference in length of pregnancy between married mothers and non married mothers.