Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mean(ncbirths$gained, na.rm = TRUE)

## [1] 30.3258

sd(ncbirths$gained, na.rm = TRUE)

## [1] 14.2413

table(!is.na(ncbirths$gained))[[2]]

## [1] 973

mn <- mean(ncbirths$gained, na.rm = TRUE)
sd <- sd(ncbirths$gained, na.rm = TRUE)
ss <- table(!is.na(ncbirths$gained))[[2]]

# Calculate t-critical value for 90% confidence
qt(0.05, lower.tail = FALSE, df = 972)

## [1] 1.646423

t <- qt(0.05, lower.tail = FALSE, df = 972)

# Calculate margin of error
t*(sd/sqrt(973))

## [1] 0.7516826

me <- t*(sd/sqrt(973))

# Boundaries of confidence interval
sd/sqrt(973)

## [1] 0.456555

se <- sd/sqrt(973) 

mn - t * se

## [1] 29.57411

mn + t * se

## [1] 31.07748

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.
How does that confidence interval compare to the one in Question #1?

# Calculate t-critical value for 95% confidence
qt(0.025, lower.tail = FALSE, df = 972)

## [1] 1.962408

t1 <- qt(0.025, lower.tail = FALSE, df = 972)

# Calculate margin of error
t1*(sd/sqrt(973))

## [1] 0.895947

me <- t1*(sd/sqrt(973))

# Boundaries of confidence interval
mn - t1 * se

## [1] 29.42985

mn + t1 * se

## [1] 31.22174

# 90% confidence interval difference
31.07-29.57

## [1] 1.5

## 95% confidence interval difference
31.22-29.43

## [1] 1.79

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
\(H_0: \mu = 7.7\)
\(H_A: \neq 7.7\)
Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
(mn_b <- mean(ncbirths$weight))

## [1] 7.101

(sd_b <- sd(ncbirths$weight))

## [1] 1.50886

table(!is.na(ncbirths$weight))

## 
## TRUE 
## 1000

# Test statistic
(ts <- (mn_b-7.7) / (sd_b/sqrt(1000)))

## [1] -12.55388

# Probability of test statistic by chance
pt(ts, df = 999)*2

## [1] 1.135415e-33

Conclusion
p < \(\alpha\), therefore we fail to reject Ho.The data suggests that the average birthweight of NC babies is different from that of European babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
\(H_0: \mu = 0\)
\(H_A: \mu \neq 0\)
Test by confidence interval or p-value and decision

# Parent Dataframe
parents <- data.frame(ncbirths$fage, ncbirths$mage)

# Calculate the column of mean differences by also creating diff variable
parents$diff <- parents$ncbirths.fage - parents$ncbirths.mage

# Calculate the test statistic
ts1 <- (mean(parents$diff, na.rm = TRUE)-0) / (sd(parents$diff, na.rm = TRUE)/sqrt(1000))

# Probability of getting that test statistic
pt(ts1, df = 999, lower.tail = FALSE)*2

## [1] 1.840249e-71

Decision p < \(\alpha\), therefore we fail to reject Ho
Conclusion
There is sufficient evidence to support the claim that there is a significant difference between the mean age of mothers and fathers.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
\(\mu_1\) = smokers
\(\mu_2\) = non smokers

\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)
Two-tailed test

Test by confidence interval or p-value and decision

# Create smoker and non-smoker subsets
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

# Calculate the mean difference between smokers and nonsmokers in weeks
md <- mean(smokers$weeks, na.rm = TRUE) - mean(nonsmokers$weeks, na.rm = TRUE)

# Determine the sample sizes of pregnant mom smokers and nonsmokers
table(!is.na(smokers$weeks))

## 
## TRUE 
##  126

table(!is.na(nonsmokers$weeks))

## 
## FALSE  TRUE 
##     1   872

# Find the standard error of the weeks variable
se_weeks <- sqrt((sd(smokers$weeks, na.rm = TRUE)^2/126) + (sd(nonsmokers$weeks, na.rm = TRUE)^2/872))

# Test statistic
ts2 <- (md - 0)/se_weeks

# Calculate the p-value
pt(-ts2, df = 125)*2

## [1] 0.6046811

Decision
p > \(\alpha\), therefore we fail to reject Ho.
Conclusion There is sufficient evidence to support the claim that there is a significant difference in length of pregnancy between smokers and non-smokers.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\(\mu_1\) = weight gained by mature mothers
\(\mu_2\) = weight gained by younger mothers

\(H_0: \mu_1 -\mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)
Two-tailed test

Test by confidence interval or p-value and decision

# Create mature and younger mom subsets
ym <- subset(ncbirths, ncbirths$mature == "younger mom")
mm <- subset(ncbirths, ncbirths$mature == "mature mom")

# Calculate the mean difference in weights 
md_weight <- mean(mm$weight) - mean(ym$weight)

# Determine the sample sizes 
table(!is.na(mm$weight))

## 
## TRUE 
##  133

table(!is.na(ym$weight))

## 
## TRUE 
##  867

# Find the standard error
se_weight <- sqrt((sd(mm$weight)^2/133)+(sd(ym$weight)^2/867))

# Test statistic
ts_weight <- (md_weight-0)/se_weight

# Calculate the p-value
pt(-ts_weight, df=132)*2

## [1] 0.8528517

Decision
p > \(\alpha\), therefore we fail to reject Ho.
Conclusion
The average weight gained by younger mothers is not significantly more than the average weight gained by mature mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

# Create "low" and "not low" subsets
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")
not_low <- subset(ncbirths, ncbirths$lowbirthweight == "not low")

# Determine the max and min values using the fivenum function
fivenum(low$weight)

## [1] 1.000 3.095 4.560 5.160 5.500

fivenum(not_low$weight)

## [1]  5.56  6.75  7.44  8.13 11.75

This data suggest that a “low’ birth weight is any values at 5.50 lbs or below, a”not low" birth weight is any values at 5.56 or above.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
In the ncbirths dataset, test if there is a significant difference in length of pregnancy between married mothers and non married mothers.
Write hypotheses
\(\mu_1\) = married mothers
\(\mu_2\) = not married mothers