Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mn <- mean(ncbirths$gained, na.rm = TRUE)
stdev <- sd(ncbirths$gained, na.rm = TRUE)
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
The mean is 30.33, the standard deviation is 14.24, and the sample size is 973.
# Calculate t-critical value for 90% confidence
t <- abs(qt(p = 0.05, df = 972))
The t-critical value is 1.65.
# Calculate margin of error
me <- t*(stdev/sqrt(973))
se <- stdev/sqrt(973)
The margin of error is 0.75.
# Boundaries of confidence interval
mn - t * se
## [1] 29.57411
mn + t * se
## [1] 31.07748
We are 90% confident that the average weight gained by North Carolina mothers during pregnancy is between 29.57 pounds and 31.07 pounds.
# The mean, standard deviation, sample size and standard error remain the same. The t-critical value and the boundaries change.
# T-critical value
t5 <- abs(qt(p = 0.025, df = 972))
# Boundaries
mn - t5 * se
## [1] 29.42985
mn + t5 * se
## [1] 31.22174
We are 95% confident that the average weight gained by North Carolina mothers during pregnancy is between 29.43 pounds and 31.22 pounds.
# 90% confidence interval difference
31.07-29.57
## [1] 1.5
# 95% confidence interval difference
31.22-29.43
## [1] 1.79
The range (the confidence interval) is wider/larger because we are getting more sure that the mean is between these certain numbers.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
Ho: \(\mu = 7.7\) Ha:\(\mu \neq 7.7\)
Two-tailed
# Sample statistics (sample mean, standard deviation, and size)
mnB <- mean(ncbirths$weight)
stdevB <- sd(ncbirths$weight)
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
The mean is 7.1 pounds, standard deviation is 1.5 pounds, and sample size is 1,000 babies.
# Test statistic
(mnB-7.7) / (stdevB/sqrt(1000))
## [1] -12.55388
# Probability of test statistic by chance
pt(-12.55388, df = 999)*2
## [1] 1.135354e-33
The p-value is 1.135e-33 which is a very tiny number.
Since the p-value is very low here, we reject the null hypothesis in favor of the alternate hypothesis. In other words, the data suggests there is some difference in the weight of newborn North Carolina babies and newborn European babies.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
Ho: \(\mu = 0\) Ha:\(\mu \neq 0\) Two-tailed test
# column of differences
ncbirths$diff <- (ncbirths$fage) - (ncbirths$mage)
# test statistic
(mean(ncbirths$diff, na.rm = TRUE)-0) / (sd(ncbirths$diff, na.rm = TRUE)/sqrt(1000))
## [1] 19.41001
# probability of getting that test statistic
pt(19.41, df = 999, lower.tail = FALSE)*2
## [1] 1.840431e-71
Since our p-value is much smaller than alpha, we will reject the null hypothesis in favor of the alternative hypothesis. In other words, the data suggests there is a significant difference between the mean age of mothers and fathers in the ncbirths dataset.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\) two-tailed
# subsets
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
# mean difference in weeks
meandiffW <- mean(smokers$weeks, na.rm = TRUE) - mean(nonsmokers$weeks, na.rm = TRUE)
# sample sizes
table(is.na(smokers$weeks))
##
## FALSE
## 126
table(is.na(nonsmokers$weeks))
##
## FALSE TRUE
## 872 1
# standard error of weeks
SEW <- sqrt((sd(smokers$weeks, na.rm = TRUE)^2/126)+(sd(nonsmokers$weeks, na.rm = TRUE)^2/872))
# test statistic
(meandiffW-0) / SEW
## [1] 0.5189962
# p-value
pt(-0.5189962, df = 125)*2
## [1] 0.604681
Since the p-value is greater than alpha, we cannot reject the null hypothesis. In other words, the data suggests there is no significant difference in the length of pregnancy for a smoker and a nonsmoker in North Carolina.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\) two-tailed
# subsets
maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")
youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")
# Mean difference in weights
meandiffM <- mean(maturemom$weight) - mean(youngermom$weight)
# sample size
table(is.na(maturemom$weight))
##
## FALSE
## 133
table(is.na(youngermom$weight))
##
## FALSE
## 867
# Standard Error for Maturity
seM <- sqrt((sd(maturemom$weight)^2/133)+(sd(youngermom$weight)^2/867))
# test statistic
(meandiffM-0)/seM
## [1] 0.1858449
# p-value
pt(-0.1858449, df=132)*2
## [1] 0.8528517
Since the p-value is greater than alpha, we cannot reject the null hypothesis. In other words, there is no significant difference in the amount of weight gained by mature mothers and younger mothers during pregnancy based on this data.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
# make subsets
small <- subset(ncbirths, ncbirths$lowbirthweight == "low")
big <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
# look at the max and min values
summary(small$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
summary(big$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
For this, I just looked at the maximum weight a newborn baby could be to be considered a “low” birth weight. I also looked at the minimum number of pounds a baby would have to weigh to be considered a “not low” birth weight.
5.500 pounds and less is a “low” birth weight. 5.560 and above is a “not low” birth weight.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Determine if younger moms tend to visit the hospital more during pregnancy than mature moms.
\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\) two-tailed
# sample sizes
table(is.na(maturemom$visits))
##
## FALSE TRUE
## 131 2
table(is.na(youngermom$visits))
##
## FALSE TRUE
## 860 7
# standard error of visits
seV <- sqrt((sd(maturemom$visits, na.rm = TRUE)^2/131)+(sd(youngermom$visits, na.rm = TRUE)^2/860))
# mean difference in visits
meandiffV <- mean(maturemom$visits, na.rm = TRUE) - mean(youngermom$visits, na.rm = TRUE)
# test statistic
(meandiffV-0)/seV
## [1] 1.439373
# p-value
pt(-1.439, df=130)*2
## [1] 0.1525539
The p-value is greater than alpha so we cannot reject the null hypothesis. In other words, this data does not prove that younger moms visit the hospital any more during pregnancy than mature (older) moms do.