Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mean <- mean(ncbirths$gained, na.rm = TRUE)
sd <- sd(ncbirths$gained, na.rm = TRUE)
n <- sum(!is.na(ncbirths$gained))

# Calculate t-critical value for 90% confidence
t <- abs(qt(p = .05, df = n - 1))

# Calculate margin of error
me <- t * sd/sqrt(n)

# Boundaries of confidence interval
mean - t*sd/sqrt(n) #lower bound

## [1] 29.57411

mean + t*sd/sqrt(n) #upper bound

## [1] 31.07748

We can be 90% confident that the average weight gained for North Carolina mothers is between 29.6 pounds and 31.1 pounds.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

#Shortcut using the t.test function 
t.test(ncbirths$gained, conf.level = .95)

## 
##  One Sample t-test
## 
## data:  ncbirths$gained
## t = 66.423, df = 972, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  29.42985 31.22174
## sample estimates:
## mean of x 
##   30.3258

How does that confidence interval compare to the one in Question #1? The confidence interval has increased in width to 29.4 for the lower bound and 31.2 for the upper bound.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
\[H_0: \mu = 7.7 lbs\] \[H_A: \mu \neq 7.7 lbs\]
Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mean2 <- mean(ncbirths$weight, na.rm = TRUE)
sd2 <- sd(ncbirths$weight, na.rm = TRUE)
n2 <- sum(!is.na(ncbirths$weight))

# Test statistic
ts2 <- abs((mean2 - 7.7)/(sd2/sqrt(n2)))

# Probability of test statistic by chance
pt(q = ts2, df = n2 - 1, lower.tail = FALSE)*2

## [1] 1.135415e-33

Conclusion
p < .0001 so we reject the null hypothesis… Thus, the data suggests that the average birthweight of NC babies is different than the average birthweight of European babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
\[H_0: \mu = 0\] \[H_A: \mu \neq 0\]
Test by confidence interval or p-value and decision

#Ensure ncbirths dataset is in our environment
ncbirths  <- ncbirths
#Create a column in the ncbirths dataset that contains the difference between each father and mother
ncbirths$diff <- ncbirths$fage - ncbirths$mage
#Find the mean, standard deviation, and sample size of the difference variable
diff_mean <- mean(ncbirths$diff, na.rm = TRUE)
diff_sd <- sd(ncbirths$diff, na.rm = TRUE)
n3 <- sum(!is.na(ncbirths$diff))
#Find the 99%, two-tailed critical value
t2 <- abs(qt(p = .005, df = n3 - 1))
#Find the lower and upper bounds of the t-confidence interval
diff_mean - t2*(diff_sd/sqrt(n3))

## [1] 2.26508

diff_mean + t2*(diff_sd/sqrt(n3))

## [1] 3.040107

Conclusion
We are 99% confident that there is a significant difference between the mean age of mothers and fathers.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
\[H_0: mu1 = mu2\] \[H_A: mu1 \neq mu2\]
Test by confidence interval or p-value and decision
Consider that mu1 is smokers and mu2 is nonsmokers…

#Store smokers and non-smokers into their own subset
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")
#Store mean, standard deviation, and sample size of the length of pregnancy variable for the smokers and nonsmokers subset
smean <- mean(smokers$weeks)
ssd <- sd(smokers$weeks)
n4 <- sum(!is.na(smokers$weeks))
nmean <- mean(nonsmokers$weeks, na.rm = TRUE)
nsd <- sd(nonsmokers$weeks, na.rm = TRUE)
n5 <- sum(!is.na(nonsmokers$weeks))
#Store standard error
se2 <- sqrt((ssd^2/n4)+(nsd^2/n5))
#Test statistic
ts4 <- (smean - nmean)/se2
#Find the p-value
pt(abs(ts4), df = 125, lower.tail = FALSE)*2

## [1] 0.6046811

Conclusion , based on a significance level of .05 There is not enough data to suggest that there is a significant difference in length of pregnancy between smokers and non-smokers.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\[H_0: mu1 = mu2\] \[H_A: mu1 > mu2\]
Test by confidence interval or p-value and decision

#Subset both the young and mature mothers 
young <- subset(ncbirths, ncbirths$mature == "younger mom")
mature <- subset(ncbirths, ncbirths$mature == "mature mom")
#Perform t-test for p-value
t.test(x = young$gained, y = mature$gained, alternative = "greater")

## 
##  Welch Two Sample t-test
## 
## data:  young$gained and mature$gained
## t = 1.3765, df = 175.34, p-value = 0.08521
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.3562741        Inf
## sample estimates:
## mean of x mean of y 
##  30.56043  28.79070

Conclusion
There is not enough data to suggest that the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

#Store low and not low birthweight subsets 
not_low <- subset(ncbirths, ncbirths$lowbirthweight == "not low")
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")
#Find maximum weight for the low subset and minimum weight for the not low subset
summary(low$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.095   4.560   4.035   5.160   5.500

summary(not_low$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.560   6.750   7.440   7.484   8.130  11.750

I concluded that any birthweight below 5.5 lbs will be classified as “low,” and anything above 5.56 lbs will be classified as “not low.” I accomplished this answer by storing birthweights classified as “low” and “not low” into their own subset to find their maximums and minimums.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
Conduct a hypothesis test at .05 significance level evaluating whether the average weight gained by smokers or non smokers is different.
Write hypotheses
\[H_0: mu1 = mu2\] \[H_A: mu1 \neq mu2\]
Test by confidence interval or p-value and decision

#Use t.test function to find p-value
t.test(x = smokers$gained, y = nonsmokers$gained, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  smokers$gained and nonsmokers$gained
## t = 1.2229, df = 150.17, p-value = 0.2233
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.126667  4.786411
## sample estimates:
## mean of x mean of y 
##  31.92623  30.09636

Conclusion
There is not enough data to suggest that the average weight gained by smokers or nonsmokers is different.