In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the
following R chunk. Do some exploratory analysis using the str()
function, viewing the dataframe, and reading its documentation to
familiarize yourself with all the variables. None of this will be
graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mean <- mean(ncbirths$gained, na.rm = TRUE)
sd <- sd(ncbirths$gained, na.rm = TRUE)
n <- sum(!is.na(ncbirths$gained))
# Calculate t-critical value for 90% confidence
t <- abs(qt(p = .05, df = n - 1))
# Calculate margin of error
me <- t * sd/sqrt(n)
# Boundaries of confidence interval
mean - t*sd/sqrt(n) #lower bound
## [1] 29.57411
mean + t*sd/sqrt(n) #upper bound
## [1] 31.07748
We can be 90% confident that the average weight gained for North Carolina mothers is between 29.6 pounds and 31.1 pounds.
#Shortcut using the t.test function
t.test(ncbirths$gained, conf.level = .95)
##
## One Sample t-test
##
## data: ncbirths$gained
## t = 66.423, df = 972, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 29.42985 31.22174
## sample estimates:
## mean of x
## 30.3258
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
Write hypotheses
\[H_0: \mu = 7.7 lbs\] \[H_A: \mu \neq 7.7 lbs\]
Test by p-value and decision
# Sample statistics (sample mean, standard deviation, and size)
mean2 <- mean(ncbirths$weight, na.rm = TRUE)
sd2 <- sd(ncbirths$weight, na.rm = TRUE)
n2 <- sum(!is.na(ncbirths$weight))
# Test statistic
ts2 <- abs((mean2 - 7.7)/(sd2/sqrt(n2)))
# Probability of test statistic by chance
pt(q = ts2, df = n2 - 1, lower.tail = FALSE)*2
## [1] 1.135415e-33
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
Write hypotheses
\[H_0: \mu = 0\] \[H_A: \mu \neq 0\]
Test by confidence interval or p-value and decision
#Ensure ncbirths dataset is in our environment
ncbirths <- ncbirths
#Create a column in the ncbirths dataset that contains the difference between each father and mother
ncbirths$diff <- ncbirths$fage - ncbirths$mage
#Find the mean, standard deviation, and sample size of the difference variable
diff_mean <- mean(ncbirths$diff, na.rm = TRUE)
diff_sd <- sd(ncbirths$diff, na.rm = TRUE)
n3 <- sum(!is.na(ncbirths$diff))
#Find the 99%, two-tailed critical value
t2 <- abs(qt(p = .005, df = n3 - 1))
#Find the lower and upper bounds of the t-confidence interval
diff_mean - t2*(diff_sd/sqrt(n3))
## [1] 2.26508
diff_mean + t2*(diff_sd/sqrt(n3))
## [1] 3.040107
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
#Store smokers and non-smokers into their own subset
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
#Store mean, standard deviation, and sample size of the length of pregnancy variable for the smokers and nonsmokers subset
smean <- mean(smokers$weeks)
ssd <- sd(smokers$weeks)
n4 <- sum(!is.na(smokers$weeks))
nmean <- mean(nonsmokers$weeks, na.rm = TRUE)
nsd <- sd(nonsmokers$weeks, na.rm = TRUE)
n5 <- sum(!is.na(nonsmokers$weeks))
#Store standard error
se2 <- sqrt((ssd^2/n4)+(nsd^2/n5))
#Test statistic
ts4 <- (smean - nmean)/se2
#Find the p-value
pt(abs(ts4), df = 125, lower.tail = FALSE)*2
## [1] 0.6046811
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
#Subset both the young and mature mothers
young <- subset(ncbirths, ncbirths$mature == "younger mom")
mature <- subset(ncbirths, ncbirths$mature == "mature mom")
#Perform t-test for p-value
t.test(x = young$gained, y = mature$gained, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: young$gained and mature$gained
## t = 1.3765, df = 175.34, p-value = 0.08521
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.3562741 Inf
## sample estimates:
## mean of x mean of y
## 30.56043 28.79070
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
#Store low and not low birthweight subsets
not_low <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")
#Find maximum weight for the low subset and minimum weight for the not low subset
summary(low$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
summary(not_low$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
I concluded that any birthweight below 5.5 lbs will be classified as “low,” and anything above 5.56 lbs will be classified as “not low.” I accomplished this answer by storing birthweights classified as “low” and “not low” into their own subset to find their maximums and minimums.
Pick a pair of numerical and categorical variables from the
ncbirths dataset and come up with a research question
evaluating the relationship between these variables. Formulate the
question in a way that it can be answered using a hypothesis test and/or
a confidence interval.
#Use t.test function to find p-value
t.test(x = smokers$gained, y = nonsmokers$gained, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: smokers$gained and nonsmokers$gained
## t = 1.2229, df = 150.17, p-value = 0.2233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.126667 4.786411
## sample estimates:
## mean of x mean of y
## 31.92623 30.09636