In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
## Warning: package 'openintro' was built under R version 3.3.3
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mn_gained <- mean(ncbirths$gained, na.rm = TRUE)
sd_gained <- sd(ncbirths$gained, na.rm = TRUE)
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
#From the table, we see there are 973 data points in the vector for weight gained
sz_gained <- 973
# Calculate t-critical value for 90% confidence
t90_gained <- qt(0.05, df = (sz_gained-1), lower.tail = TRUE)
round(t90_gained, 4)
## [1] -1.6464
The negative t-value for a 90% confidence interval with 972 degrees of freedom is -1.6464.
# Calculate margin of error
ME90_gained <- abs(t90_gained)*sd_gained/sqrt(sz_gained)
round(ME90_gained, 4)
## [1] 0.7517
The margin of error is 0.7517.
# Boundaries of confidence interval
#The lower bound is
mn_gained - ME90_gained
## [1] 29.57411
#The upper bound is
mn_gained + ME90_gained
## [1] 31.07748
The 90% confidence interval for the average weight gained by North Carolina mothers is (29.57, 31.08) lbs.
#To find the 95% confidence interval, first find the t-score that corresponds to 95%.
t95_gained <- qt(0.025, df = (sz_gained - 1))
#Then calculate the new bounds.
#Lower:
mn_gained - abs(t95_gained)*sd_gained/sqrt(sz_gained)
## [1] 29.42985
#Upper:
mn_gained + abs(t95_gained)*sd_gained/sqrt(sz_gained)
## [1] 31.22174
The 95% confidence interval for the average weight gained for North Carolina mothers is (29.43, 31.22) lbs.
The 95% confidence interval is larger than the 90% confidence interval, reflecting our greater confidence that the true mean weight gain of all North Carolina mothers lies within the bounds of 95% confidence interval.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
Write hypotheses
\(H_O: \mu_{NC} = \mu_E = 7.7 lbs.\)
\(H_A: \mu_{NC} \ne 7.7 lbs.\)
Test by p-value and decision
# Sample statistics (sample mean, standard deviation, and size)
mn_weight <- mean(ncbirths$weight, na.rm = TRUE)
sd_weight <- mean(ncbirths$weight)
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
#There are no missing data from the baby weight column, so the sample size is 1000.
sz_weight <- 1000
# Test statistic
t95_weight <- (mn_weight - 7.7)/(sd_weight/sqrt(sz_weight))
# Probability of test statistic by chance
pt(abs(t95_weight), df = sz_weight-1, lower.tail = FALSE)
## [1] 0.003882509
The probability of a getting a sample mean of 7.1 lbs. when the true population mean is 7.7 lbs. is 0.00388. This is smaller than the significance level of 0.05, so we reject the null hypothesis in favor of the alternate hypothesis.
c. Conclusion
The data suggests that the mean birth weight of babies born in North Carolina differs from the mean birth weight of European babies, 7.7 lbs.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
Write hypotheses
\(H_0: \bar{x}_{mage} - \bar{x}_{fage} = 0\)
\(H_A: \bar{x}_{mage} - \bar{x}_{fage} \ne 0\)
two-tailed
Test by confidence interval or p-value and decision
t.test(ncbirths$fage, ncbirths$mage)
##
## Welch Two Sample t-test
##
## data: ncbirths$fage and ncbirths$mage
## t = 10.631, df = 1701.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.655048 3.856411
## sample estimates:
## mean of x mean of y
## 30.25573 27.00000
The p-value is less than \(\alpha\) = 0.05, therefore we reject \(H_0\).
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
Write hypotheses
\(H_0: \bar{x}_{weeksS} - \bar{x}_{weeksNS} = 0\)
\(H_A: \bar{x}_{weeksS} - \bar{x}_{weeksNS} \ne 0\)
two-tailed
Test by confidence interval or p-value and decision
#First create two subsets. One for smokers, a second for nonsmokers
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
#Then use the t.test to find the p-value
t.test(smokers$weeks, nonsmokers$weeks)
##
## Welch Two Sample t-test
##
## data: smokers$weeks and nonsmokers$weeks
## t = 0.519, df = 182.63, p-value = 0.6044
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3519903 0.6032646
## sample estimates:
## mean of x mean of y
## 38.44444 38.31881
The p-value is larger than a significance level of 0.05 and the 95% confidence interval includes zero. Therefore, we fail to reject the null hypothesis.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
#First create two subsets. One for younger moms, a second for mature moms
younger <- subset(ncbirths, ncbirths$mature == "younger mom")
mature <- subset(ncbirths, ncbirths$mature == "mature mom")
#Then use the t.test to find the p-value
t.test(younger$gained, mature$gained, alternative = "greater")
##
## Welch Two Sample t-test
##
## data: younger$gained and mature$gained
## t = 1.3765, df = 175.34, p-value = 0.08521
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.3562741 Inf
## sample estimates:
## mean of x mean of y
## 30.56043 28.79070
The p-value is greater than \(\alpha\) = 0.05, so we fail to reject the null hypothesis.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
#Let's create two subsets, one for "low" weight babies and a second for "not low" weight babies.
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")
notlow <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
#Generate a summary of each subset
fivenum(low$weight)
## [1] 1.000 3.095 4.560 5.160 5.500
fivenum(notlow$weight)
## [1] 5.56 6.75 7.44 8.13 11.75
The heaviest “low” baby weighed 5.5 lbs, while the lightest “not low” baby weighed 5.56 lbs. Based on those numbers the cut off is between 5.5 and 5.56 lbs, inclusive. The value that makes the most sense to me is 5.5 lbs. Babies weighing 5.5 lbs or less are classified as “low”, while babies weighing more than 5.5 lbs are classified as “not low”.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
Are boys’ birth weights higher than girls’ birth weights at the 0.01 significance level?
Write hypotheses
\(H_0: \bar{x}_{weightB} - \bar{x}_{weightG} = 0\)
\(H_A: \bar{x}_{weightB} - \bar{x}_{weightG} \gt 0\)
one-tailed
Test by confidence interval or p-value and decision
#First create two subsets. One for boys, another for girls
boys <- subset(ncbirths, ncbirths$gender == "male")
girls <- subset(ncbirths, ncbirths$gender == "female")
#Then use the t.test to find the p-value
t.test(boys$weight, girls$weight, alternative = "greater", conf.level = 0.99)
##
## Welch Two Sample t-test
##
## data: boys$weight and girls$weight
## t = 4.2113, df = 996.45, p-value = 1.384e-05
## alternative hypothesis: true difference in means is greater than 0
## 99 percent confidence interval:
## 0.1780681 Inf
## sample estimates:
## mean of x mean of y
## 7.301509 6.902883
The p-value is less than \(\alpha\) = 0.01, so we reject the null hypothesis.