In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mean.gained <- mean(ncbirths$gained, na.rm = TRUE)
sd.gained <-sd(ncbirths$gained, na.rm = TRUE)
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
sample.size <- 973
# Calculate t-critical value for 90% confidence
abs(qt(.05, df = 972))
## [1] 1.646423
# Calculate margin of error
1.646423 * 14.24129662/sqrt(973)
## [1] 0.7516827
# Boundaries of confidence interval
#Lower bound
30.3258 - 0.7516827
## [1] 29.57412
#Upper Bound
30.3258 + 0.7516827
## [1] 31.07748
We are 90% Confident that the avarage weight gained for North Calorina mothers is between 29.57 pounds and 31.08 pounds.
#Calculate t-critical value for 95% confidence
abs(qt(.025, df = 972))
## [1] 1.962408
#Boundaries of confidence interval
#Lower bound
30.3258 - 1.962408*14.24129662/sqrt(973)
## [1] 29.42985
#Upper bound
30.3258 + 1.962408*14.24129662/sqrt(973)
## [1] 31.22175
We are 95% Confident that the avarage weight gained for North Calorina mothers is between 29.43 pounds and 31.22 pounds.
The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
Write hypotheses
\(H0: \mu = 7.7\) \(HA: \mu \not = 7.7\)
Test by p-value and decision
# Sample statistics (sample mean, standard deviation, and size)
mean.weight <- mean(ncbirths$weight, na.rm = TRUE)
sd.weight <- sd(ncbirths$weight, na.rm = TRUE)
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
# Test statistic
(7.101 - 7.7)/(1.50886/sqrt(1000))
## [1] -12.55388
The t-score is -12.55388
# Probability of test statistic by chance
pt(-12.55388, df = 999)*2
## [1] 1.135354e-33
Decision: The p- value is less than 0.05, so we reject the Ho (Null Hypothesis).
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
Write hypotheses
\(Ho: \mu = 0\) \(HA: \mu \not = 0\)
Test by confidence interval or p-value and decision
#Create a column of differences
ncbirths$diff <- ncbirths$fage - ncbirths$mage
#Sample statistics (sample mean, standard deviation, and size)
mean(ncbirths$diff, na.rm = TRUE)
## [1] 2.652593
sd(ncbirths$diff, na.rm = TRUE)
## [1] 4.321604
table(is.na(ncbirths$diff))
##
## FALSE TRUE
## 829 171
# Calculating the t-score
(mean(ncbirths$diff,na.rm = TRUE)-0)/(sd(ncbirths$diff,na.rm = TRUE)/sqrt(829))
## [1] 17.6727
#Probability of getting the test statistic
pt(17.6727, df = 828, lower.tail = FALSE)*2
## [1] 1.504649e-59
Decision: The p- value is less than 0.05, so we reject the null hypothesis.
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
Write hypotheses
\(H0: \mu1-\mu2 = 0\) \(HA: \mu1-\mu2 \not = 0\)
Test by confidence interval or p-value and decision
#This stores smokers and nonsmokers subsets
smoker <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
#Finds the mean for the smokers and non smokers
mean.smoker <- mean(smoker$weeks, na.rm = TRUE)
mean.nonsmokers <- mean(nonsmokers$weeks, na.rm = TRUE)
#finds the standard deviation for smokers and non smokers.
sd.smoker <- sd(smoker$weeks, na.rm = TRUE)
sd.nonsmokers <- sd(nonsmokers$weeks, na.rm = TRUE)
summary(ncbirths$habit)
## nonsmoker smoker NA's
## 873 126 1
#Finds the standard error
SE <- sqrt((sd.smoker^2/126)+(sd.nonsmokers^2/873))
#Finds test statistic
(((mean.smoker)-(mean.nonsmokers))-0)/SE
## [1] 0.5190483
#Probability of getting the test statistic
pt(0.5190483, df = 125, lower.tail = FALSE)*2
## [1] 0.6046448
Decision: The p-value is above 0.05 so we fail to reject the null hypothesis.
c.Conclusion
There is not sufficient data to suggest that there is significant difference in length of pregnancy between smokers and non-smokers.
Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
Write hypotheses
\(H0: \mu1-\mu2 = 0\) \(HA: \mu1-\mu2 \not = 0\)
Test by confidence interval or p-value and decision
#Young mother and mature mothers subset.
y.mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
m.mothers <- subset(ncbirths, ncbirths$mature == "mature mom")
#Sample statistics (sample mean, standard deviation, and size)
mean.y.mothers <- mean(y.mothers$gained, na.rm = TRUE)
mean.m.mothers <- mean(m.mothers$gained, na.rm = TRUE)
sd.y.mothers <- sd(y.mothers$gained, na.rm = TRUE)
sd.m.mothers <- sd(m.mothers$gained, na.rm = TRUE)
table(is.na(y.mothers$gained))
##
## FALSE TRUE
## 844 23
table(is.na(m.mothers$gained))
##
## FALSE TRUE
## 129 4
#Create a column of differences
agediff.gained <- (mean.y.mothers-mean.m.mothers)
#Finds the standard error
SE.age.diff <- sqrt((sd.y.mothers^2/844)+(sd.m.mothers^2/129))
#test statistic
(1.7697 - 0)/1.2857
## [1] 1.376449
(agediff.gained-0)/SE.age.diff
## [1] 1.376483
#P-value
pt(1.376486, df = 128, lower.tail = FALSE)
## [1] 0.08553716
Decision: P Value is above 0.05 therefore we fail to reject H0.
Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)
#table of low birthweight variable
table(ncbirths$lowbirthweight)
##
## low not low
## 111 889
#Subsets of low and not low birthweight
low<-subset(ncbirths,ncbirths$lowbirthweight=="low")
notlow<-subset(ncbirths,ncbirths$lowbirthweight=="not low")
#Fivenumber summary
fivenum(low$weight)
## [1] 1.000 3.095 4.560 5.160 5.500
fivenum(notlow$weight)
## [1] 5.56 6.75 7.44 8.13 11.75
The weight cutoff for babies being classified as low is 5.50. To get the answer, I created subsets of the low and not low birthweight and found the five number summary for both. This gives us the minimum value, the first quartile, the median, the third quartile, and the maximum value for each subset. From this we see that the maximum value for “low” is 5.50. I believe this is accurate because looking at the table of data, all babies with a weight of 5.50 and under are considered low. There is a possibility that true cut-off is between 5.50 and 5.56 lbs and that there is not enough data to determine the true cut-off.
Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Question
Is there difference in women’s length of pregancy between younger mothers and mature mothers?
Write hypotheses
\(H0: \mu1-\mu2 = 0\) \(HA: \mu1-\mu2 \not = 0\)
#Mature and younger mom subsets
mature.mom <- subset(ncbirths, ncbirths$mature == "mature mom")
younger.mom <-subset(ncbirths, ncbirths$mature == "younger mom")
#Mean and standard deviation of sample
mean.y.length <- mean(mature.mom$weeks, na.rm = TRUE)
mean.m.length <- mean(younger.mom$weeks, na.rm = TRUE)
sd.mature.length <- sd(mature.mom$weeks, na.rm = TRUE)
sd.younger.length <-sd(younger.mom$weeks, na.rm = TRUE)
#Mean difference length in mature and younger mom
mean.diff.length <- (mean.m.length - mean.y.length)
#Sample size
table(is.na(mature.mom$weeks))
##
## FALSE TRUE
## 132 1
table(is.na(younger.mom$weeks))
##
## FALSE TRUE
## 866 1
#Standard error
SE.age.length <- sqrt((sd.mature.length^2/132) + (sd.younger.length^2/866))
#Test Statistic
(mean.diff.length - 0)/SE.age.length
## [1] 1.211299
(0.35948981 - 0)/0.29678
## [1] 1.211301
#Probability of getting the test statistic
pt(1.211299, df = 131, lower.tail = FALSE)*2
## [1] 0.2279614
Decision: The p value is 0.228 which is above 0.05, so we fail to reject the H0.