In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
meang <- mean(ncbirths$gained, na.rm = TRUE)
sdg <- sd(ncbirths$gained, na.rm = TRUE)
(ssg <- table(!is.na(ncbirths$gained))[[2]])
## [1] 973
The mean for the weight gained by mother during pregnancy is approximately 30.33.
The standard deviation for the weight gained by mother during pregnancy is approximately 14.24.
And the smaple size for the weight gained by mother during pregnancy is 973.
# Calculate t-critical value for 90% confidence
abs(qt(0.05, df = 972))
## [1] 1.646423
The t-critical value is 1.646423.
# Calculate margin of error
1.646423 * (sdg/sqrt(973))
## [1] 0.7516827
The margin of error is 0.7516827.
# Boundaries of confidence interval
meang - 0.7516827
## [1] 29.57411
meang + 0.7516827
## [1] 31.07748
We are 90% confident that the average weight gained by mother during pregnancy is between 29.57 and 31.08 pounds.
# Calculate t-critical value for 95% confidence
abs(qt(0.025, df = 972))
## [1] 1.962408
The t-critical value is 1.962408
# Margin of error
1.962408 * (sdg/sqrt(973))
## [1] 0.8959472
The margin of error is 0.8959472
# Lower bound
meang - 0.8959472
## [1] 29.42985
# Upper bound
meang + 0.8959472
## [1] 31.22174
We are 95% confident that the average weight gained by mother during pregnancy is between 29.43 and 31.22 pounds.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
\(H_0: \mu = 7.7\)
\(H_A: \mu \neq 7.7\)
# Sample statistics (sample mean, standard deviation, and size)
meanw <- mean(ncbirths$weight)
sdw <- sd(ncbirths$weight)
(ssw <- table(!is.na(ncbirths$weight)))
##
## TRUE
## 1000
The mean of the birthweight of North Carolina babies is 7.101
The standard deviation of the birthweight of North Carolina babies is 1.51
The ample size of the birthweight of Northcarolina babies is 1000.
# Test statistic
(meanw - 7.7) / (sdw / sqrt(1000))
## [1] -12.55388
The t-score is equal to -12.55388
# Probability of test statistic by chance
pt(-12.55388, df = 999)*2
## [1] 1.135354e-33
The p-value is 1.135354e-33
Decision: The p-value is very low so we reject the null hypothesis (\(H_0\))
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
\(H_0: \mu = 0\)
\(H_A: \mu \neq 0\)
# Creating a diff variable
ncbirths$diff <- ncbirths$fage - ncbirths$mage
# finding the mean, standard deviation and sample size
meand <- mean(ncbirths$diff, na.rm = TRUE)
sdd <- sd(ncbirths$diff, na.rm = TRUE)
(ssd <- table(!is.na(ncbirths$diff))[[2]])
## [1] 829
The mean of the age difference is 2.65
The standard deviation of the age difference is 4.32
The sample size is 829.
# Test Statistics
(meand - 0) / (sdd/sqrt(829))
## [1] 17.6727
The t-score is 17.6727
# Probabilty of test statistics by chance
pt(17.6727, df = 828, lower.tail = FALSE)*2
## [1] 1.504649e-59
The p-value is equal to 1.504649e-59
Decision: The p-value is very low therefore we reject the null hypothesis (\(H_0\))
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)
# smokers subset
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
# non smoker subset
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
# Finding the mean, standard deviation and sample size of each
means <- mean(smokers$weeks, na.rm = TRUE)
meannons <- mean(nonsmokers$weeks, na.rm = TRUE)
sds <- sd(smokers$week, na.rm = TRUE)
sdnons <- sd(nonsmokers$weeks, na.rm = TRUE)
(sss <- table(!is.na(smokers$weeks)))
##
## TRUE
## 126
(ssnons <- table(!is.na(nonsmokers$weeks))[[2]])
## [1] 872
Mean pregnancy lenght for smokers is 38.44
Standard deviation pregnancy lenght for smokers is 2.47
The sample size for the pregnancy lenght for smokers is 126
Mean pregnancy lenght for nonsmokers is 38.32
Standard deviation pregnancy lenght for nonsmokers is 2.99
The sample size for the pregnancy lenght for nonsmokers is 873
# standard error
sqrt((sds^2/126) + (sdnons^2/873))
## [1] 0.2420528
The standard error is 0.2420528
# test statistics
((means - meannons)-0) / 0.2420528
## [1] 0.5190483
t-score is 0.5190483
# p-value
pt(0.5190483, df = 125, lower.tail = FALSE)*2
## [1] 0.6046448
The p-value is 0.6046448
Decision: The p-value is more than alpha so we fail to reject the null hypothesis.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
Write hypotheses
\(H_0: \mu_1 = \mu_2\)
\(H_A: \mu_1 > \mu_2\)
Test by confidence interval or p-value and decision
# younger mothers subset
youngermoms <- subset(ncbirths, ncbirths$mature == "younger mom")
# mature mothers subset
maturemoms <- subset(ncbirths, ncbirths$mature == "mature mom")
# Finding the mean, standard deviation and sample size
meany <- mean(youngermoms$gained, na.rm = TRUE)
meanm <- mean(maturemoms$gained, na.rm = TRUE)
sdy <- sd(youngermoms$gained, na.rm = TRUE)
sdm <- sd(maturemoms$gained, na.rm = TRUE)
(ssy <- table(!is.na(youngermoms$gained)))
##
## FALSE TRUE
## 23 844
(ssm <- table(!is.na(maturemoms$gained)))
##
## FALSE TRUE
## 4 129
Mean weight gained for younger mothers 30.56
Standard deviation of the weight gained for younger mothers 14.35
Mean weight gained for mature mothers 28.79
Standard deviation of the weight gained for mature mothers 13.48
younger moms sample size 844
Mature moms smaple size 129
# Standard error
sqrt((sdy^2/844) + (sdm^2/129))
## [1] 1.285689
The standard error is 1.285689
# test statistics
((meany - meanm) - 0) / 1.285689
## [1] 1.376483
The t-score is 1.376483
# p-value
pt(1.376483, df = 128, lower.tail = FALSE)*2
## [1] 0.1710753
The p-value is 0.1710753
Decision: We fail to reject the null hypothesis
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
# creating subsets
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")
not_low <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
# Analyzing birthweight
mean(low$weight)
## [1] 4.034775
mean(not_low$weight)
## [1] 7.483847
Using the mean function, I can say on average the birthweight for the low pregnancies is less than the ones that aren’t low.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
Is there a difference in the number of visits for younger mothers and mature mothers?
Write hypotheses
\(H_0: \mu = 0\)
\(H_A: \mu \neq 0\)
Test by confidence interval or p-value and decision
# Younger mothers subset
youngermothers <- subset(ncbirths, ncbirths$mature == "younger mom")
# Mature mothers subset
maturemothers <- subset(ncbirths, ncbirths$mature == "mature mom")
# Finding the mean, standard deviation and sample size
(meanym <- mean(youngermothers$visits, na.rm = TRUE))
## [1] 12.02791
(meanmm <- mean(maturemothers$visits, na.rm = TRUE))
## [1] 12.61069
(sdym <- sd(youngermothers$visits, na.rm = TRUE))
## [1] 3.883239
(sdmm <- sd(maturemothers$visits, na.rm = TRUE))
## [1] 4.379274
table(!is.na(youngermothers$visits))
##
## FALSE TRUE
## 7 860
table(!is.na(maturemothers$visits))
##
## FALSE TRUE
## 2 131
# Standard Error
sqrt((sdmm^2/131)+(sdym^2/860))
## [1] 0.4048847
# test statistics
((meanmm - meanym)-0) / 0.4048847
## [1] 1.439373
# Probability
pt(1.439373, df = 130, lower.tail = FALSE)*2
## [1] 0.1524484
p-value is equal to 0.1524484
Decision: We fail to reject \(H_0\)