In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.
Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.
# Load Openintro Library
library(openintro)
# Store ncbirths in environment
ncbirths <- ncbirths
str(ncbirths)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.
# Store mean, standard deviation, and sample size of dataset
mean.gained <- mean(ncbirths$gained, na.rm = TRUE)
sd.gained <- sd(ncbirths$gained, na.rm = TRUE)
table(is.na(ncbirths$gained))
##
## FALSE TRUE
## 973 27
sample.size <- 973
# Calculate t-critical value for 90% confidence
abs(qt(.05, df=972))
## [1] 1.646423
# Calculate margin of error
1.646423 * sd.gained/sqrt(sample.size)
## [1] 0.7516827
# Boundaries of interval
mean.gained - 0.7516827
## [1] 29.57411
mean.gained + 0.7516827
## [1] 31.07748
There is a 90% chance that the average weigth gained during pregnacy of the sample size is 29.57 and 31.08lbs.
abs(qt(.025, df=sample.size-1))
## [1] 1.962408
##1.962408
mean.gained - 1.962341*sd.gained/sqrt(sample.size)
## [1] 29.42988
mean.gained + 1.962341*sd.gained/sqrt(sample.size)
## [1] 31.22171
There is a 95% confidence interval that average weight gained during pregnacy by the sample size is 29.43 and 31.22 lbs. ### Question 3 - Single Sample t-test The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.
HA: μ ≠ 7.7 b. Test by p-value and decision
# Sample statistics (sample mean, standard deviation, and size)
mean.weight <- mean(ncbirths$weight, na.rm = TRUE)
sd.weight <- sd(ncbirths$weight, na.rm=TRUE)
table(is.na(ncbirths$weight))
##
## FALSE
## 1000
##1000
# Test statistic
(mean.weight - 7.7)/(sd.weight/sqrt(1000))
## [1] -12.55388
t-score = -12.55388
# Probability of test statistic by chance
pt(-12.55388, df=999)*2
## [1] 1.135354e-33
The data suggest that there is a difference between the average European birthweight and of the sample size.
In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.
HA:μd≠0 b. Test by confidence interval or p-value and decision
ncbirths$dif <- ncbirths$fage - ncbirths$mage
table(is.na(ncbirths$dif))
##
## FALSE TRUE
## 829 171
#The sample size is 829.
(mean(ncbirths$dif,na.rm=TRUE)-0)/(sd(ncbirths$dif,na.rm=TRUE)/sqrt(829))
## [1] 17.6727
## [1] 17.6727
# T-score is 17.6727
pt(17.6727, df=828, lower.tail=FALSE)*2
## [1] 1.504649e-59
## [1] 1.504649e-59
The data also displays that there is a difference in the average age of the mothers and fathers in the sample size. ### Question 5 - Two Indendent Sample t-test In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.
HA:μ1−μ2≠0
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
mean.smoker <- mean(smokers$weeks, na.rm = TRUE)
mean.nonsmoker <- mean(nonsmokers$weeks, na.rm = TRUE)
sd.smokers <- sd(smokers$weeks, na.rm =TRUE)
sd.nonsmokers <- sd(nonsmokers$weeks, na.rm = TRUE)
summary(ncbirths$habit)
## nonsmoker smoker NA's
## 873 126 1
## nonsmoker smoker NA's
## 873 126 1
SE <- sqrt((sd.smokers^2/126)+(sd.nonsmokers^2/873))
(((mean.smoker)-(mean.nonsmoker))-0)/SE
## [1] 0.5190483
## [1] 0.5190483
pt(0.5190483, df=125, lower.tail=FALSE)*2
## [1] 0.6046448
## [1] 0.6046448
The data is not strong enought to reject the null hypothesis.
Conclusion
The data does not suggest that there is a large difference in pregnacy lengths by smokers and non smokers based off the sample size. ### Question 6 Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.
Write hypotheses
Test by confidence interval or p-value and decision
y.mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
m.mothers <- subset(ncbirths, ncbirths$mature == "mature mom")
mean.y.mothers <- mean(y.mothers$gained, na.rm=TRUE)
mean.m.mothers <- mean(m.mothers$gained, na.rm=TRUE)
sd.y.mothers <- sd(y.mothers$gained, na.rm=TRUE)
sd.m.mothers <- sd(m.mothers$gained, na.rm=TRUE)
agediff.gained <- (mean.y.mothers - mean.m.mothers)
table(is.na(y.mothers$gained))
##
## FALSE TRUE
## 844 23
## 844
table(is.na(m.mothers$gained))
##
## FALSE TRUE
## 129 4
## 129
SE.age.diff <- sqrt((sd.y.mothers^2/844)+(sd.m.mothers^2/129))
#Finding the test statistic
(agediff.gained - 0)/SE.age.diff
## [1] 1.376483
## [1] 1.376483
#Finding the p-value
pt(1.376483, df=128, lower.tail=FALSE)
## [1] 0.08553763
## [1] 0.08553763
Conclusion We fail to reject the null hypothesis. The data based on the sample size does not suggest a big difference in weight gained by mature or younger mothers. ### Question 8 Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.
Test by confidence interval or p-value and decision
Mature.mom <- subset(ncbirths, ncbirths$mature == "mature mom")
younger.mom <- subset(ncbirths, ncbirths$mature == "younger mom")
mean.y.length <- mean(Mature.mom$weeks, na.rm = TRUE)
mean.m.length <- mean(younger.mom$weeks, na.rm = TRUE)
sd.mature.length <- sd(Mature.mom$weeks,na.rm = TRUE)
sd.younger.length <-sd(younger.mom$weeks, na.rm = TRUE)
mean.diff.length <- (mean.m.length - mean.y.length)
table(is.na(Mature.mom$weeks))
##
## FALSE TRUE
## 132 1
##132
table(is.na(younger.mom$weeks))
##
## FALSE TRUE
## 866 1
##866
SE.age.lenght <- sqrt((sd.mature.length^2/132)+(sd.younger.length^2/866))
(mean.diff.length- 0)/ SE.age.lenght
## [1] 1.211299
##[1] 1.211299
pt(1.211299, df=131, lower.tail = FALSE)*2
## [1] 0.2279614
The p value is 0.228 which is less than alpha so we favor the alterative hypothesis. Which supports the statement that there is a difference in the length of pregnacy between younger moms and older moms.