In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths
dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Determine NAs in ncbirths$gained
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
# Store mean, standard deviation, and sample size of dataset
mngain <- mean(ncbirths$gained, na.rm = TRUE)
sdgain <- sd(ncbirths$gained, na.rm = TRUE)
lengthgain <- 973
dfgain <- 973 - 1
# mn, sd, df ,ss
mngain
## [1] 30.3258
sdgain
## [1] 14.2413
lengthgain
## [1] 973
dfgain
## [1] 972
# Calculate t-critical value for 90% confidence
tcrit <- abs(qt(0.05, dfgain))
The t-critical value is 1.65
# Calculate margin of error
megain <- tcrit*(sdgain/sqrt(lengthgain))
The Margin of error is 0.75
# Boundaries of confidence interval
mngain - megain
## [1] 29.57411
mngain + megain
## [1] 31.07748
We are 90% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.57 and 31.07 pounds.
# Calculate t-critical value for 95% confidence
tcon <- abs(qt(.025, dfgain))
The t-critical value is 1.96
# Margin of Error
moe <- tcon*(sdgain/sqrt(lengthgain))
Margin of Error is 0.89
# Boundaries of confidence interval
mngain - moe
## [1] 29.42985
mngain + moe
## [1] 31.22174
We are 95% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.43 pounds and 31.22 pounds.
# 90% confidence interval difference
31.07 - 29.57
## [1] 1.5
# 95% confidence interval difference
31.22 - 29.43
## [1] 1.79
The confidence interval’s range is bigger because we are more confident that the mu is between the intervals we found.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
Write hypotheses
\(H_0\): \(\mu\) = 7.7 \(H_A\): \(\mu\) \(\neq\) 7.7
Test by p-value and decision
# Determine NAs in ncbirths$weight
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
# Sample statistics (sample mean, standard deviation, and size)
mnbweight <- mean(ncbirths$weight)
sdbweight <- sd(ncbirths$weight)
lengthbweight <- 1000
dfbweight <- 1000-1
# Test statistic
teststat <- (mnbweight - 7.7) / (sdbweight/sqrt(lengthbweight))
teststat
## [1] -12.55388
# Probability of test statistic by chance
pt(teststat, dfbweight)*2
## [1] 1.135415e-33
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
Write hypotheses
\(H_0\): \(\mu\) = 0 \(H_A\): \(\mu\) \(\neq\) 0
Test by confidence interval or p-value and decision
# Calculate Age Difference and Store mean, standard deviation, and sample size of parents age
ncbirths$diff <- ncbirths$fage - ncbirths$mage
table(is.na(ncbirths$diff))
##
## FALSE TRUE
## 829 171
mndiff <- mean(ncbirths$diff, na.rm = TRUE)
sddiff <- sd(ncbirths$diff, na.rm = TRUE)
lengthdiff <- 829
dfdiff <- 829 - 1
# Test stat
statdiff <- (mndiff-0) / (sddiff/sqrt(lengthdiff))
statdiff
## [1] 17.6727
# Probability Test
pt(statdiff, dfdiff, lower.tail = FALSE)*2
## [1] 1.504608e-59
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
Write hypotheses
\(H_0\): \(\mu_1\) - \(\mu_2\) = 0 \(H_A\): \(\mu_1\) - \(\mu_2\) \(\neq\) 0
Test by confidence interval or p-value and decision
# Create subsets
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
# Calculate Week Difference and Store mean, standard deviation, and sample size of pregnancy lengths depending on smoking habit
table(is.na(smokers$weeks))
##
## FALSE
## 126
table(is.na(nonsmokers$weeks))
##
## FALSE TRUE
## 872 1
mnsmokersW <- mean(smokers$weeks, na.rm = TRUE)
mnnonsmokersW <- mean(nonsmokers$weeks, na.rm = TRUE)
sdsmokersW <- sd(smokers$weeks, na.rm = TRUE)
sdnonsmokersW <- sd(nonsmokers$weeks, na.rm = TRUE)
lengthsmokersW <- 126
lengthnonsmokersW <- 872
dfsmokersW <- 126 - 1
dfnonsmokersW <- 872 - 1
# mean of difference in weeks
mndiffW <- mnsmokersW - mnnonsmokersW
mndiffW
## [1] 0.1256371
# Standard Error of Weeks
seW <- sqrt((sdsmokersW^2 /lengthsmokersW) + (sdnonsmokersW^2 / lengthnonsmokersW))
# Test Stat
teststatW <- (mndiffW - 0) / seW
teststatW
## [1] 0.5189962
# p-value
pt(teststatW, dfnonsmokersW, lower.tail = FALSE)*2
## [1] 0.6038953
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
# create subsets
maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")
youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")
dfymom <- 128
# Mean difference in weights
mndiffM <- abs(mean(maturemom$gained, na.rm = TRUE) - mean(youngermom$gained, na.rm = TRUE))
# Sample Size
table(is.na(maturemom$gained))
##
## FALSE TRUE
## 129 4
table(is.na(youngermom$gained))
##
## FALSE TRUE
## 844 23
# Standard Error
seM <- sqrt((sd(maturemom$gained, na.rm = TRUE)^2/133)+(sd(youngermom$gained, na.rm = TRUE)^2/867))
# Test Stat
teststatM <- (mndiffM-0)/seM
# p-value
pt(teststatM, dfymom, lower.tail=FALSE)
## [1] 0.08237283
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
# create subsets
smallBaby <- subset(ncbirths, ncbirths$lowbirthweight == "low")
bigBaby <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
# summary of weights of babies
summary(smallBaby$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.095 4.560 4.035 5.160 5.500
summary(bigBaby$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.560 6.750 7.440 7.484 8.130 11.750
table(ncbirths$lowbirthweight)
##
## low not low
## 111 889
table(smallBaby$premie)
##
## full term premie
## 30 80
table(bigBaby$premie)
##
## full term premie
## 816 72
I created two subsets for the baby’s weight, a low weight and a not low weight subsets. I used the summary function to determine what was the lowest and highest for each subset. Anything 5.559 pounds and under was considered “low” and anything 5.560 poounds and above was considered “not low”.
Pick a pair of numerical and categorical variables from the ncbirths
dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
Is there a significant difference the length of pregnancy between younger mothers and mature mothers.
Write hypotheses
\(H_0\): \(\mu_1\) = 0 \(H_A\): \(\mu_1\) \(\neq\) 0
Test by confidence interval or p-value and decision
# Create subsets and store mean and standard deviation
maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")
youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")
mnweeksYM <- mean(youngermom$weeks, na.rm = TRUE)
mnweeksMM <- mean(maturemom$weeks, na.rm = TRUE)
sdweeksYM <- sd(youngermom$weeks, na.rm = TRUE)
sdweeksMM <- sd(maturemom$weeks, na.rm = TRUE)
mndiffweeks <- (mnweeksMM - mnweeksYM)
# Determine if any NAs
table(is.na(maturemom$weeks))
##
## FALSE TRUE
## 132 1
table(is.na(youngermom$weeks))
##
## FALSE TRUE
## 866 1
# Standard error
seMOMweeks <- sqrt((sdweeksMM^2/132)+(sdweeksYM^2/866))
seMOMweeks
## [1] 0.2967804
# Test Stat
weeksSTAT <- abs((mndiffweeks- 0) / seMOMweeks)
weeksSTAT
## [1] 1.211299
# Find p-value
pt(weeksSTAT, df = 131, lower.tail = FALSE)*2
## [1] 0.2279614
The p-value is 0.228 which is less than alpha. This means we reject null hypothesis in favor of the alternative hypothesis. There is a difference in pregnancy lengths of younger and mature moms.