Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
meang <- mean(ncbirths$gained, na.rm = TRUE)
sdg <- sd(ncbirths$gained, na.rm = TRUE)
(ssg <- table(!is.na(ncbirths$gained))[[2]])

## [1] 973

The mean for the weight gained by mother during pregnancy is approximately 30.33.
The standard deviation for the weight gained by mother during pregnancy is approximately 14.24.
And the smaple size for the weight gained by mother during pregnancy is 973.

# Calculate t-critical value for 90% confidence
abs(qt(0.05, df = 972))

## [1] 1.646423

The t-critical value is 1.646423.

# Calculate margin of error
1.646423 * (sdg/sqrt(973))

## [1] 0.7516827

The margin of error is 0.7516827.

# Boundaries of confidence interval
meang - 0.7516827

## [1] 29.57411

meang + 0.7516827

## [1] 31.07748

We are 90% confident that the average weight gained by mother during pregnancy is between 29.57 and 31.08 pounds.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

# Calculate t-critical value for 95% confidence
abs(qt(0.025, df = 972))

## [1] 1.962408

The t-critical value is 1.962408

# Margin of error
1.962408 * (sdg/sqrt(973))

## [1] 0.8959472

The margin of error is 0.8959472

# Lower bound
meang - 0.8959472

## [1] 29.42985

# Upper bound
meang + 0.8959472

## [1] 31.22174

We are 95% confident that the average weight gained by mother during pregnancy is between 29.43 and 31.22 pounds.

How does that confidence interval compare to the one in Question #1?
The 95% confidence interval is wider than the 90% confidence interval.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses

\(H_0: \mu = 7.7\)
\(H_A: \mu \neq 7.7\)

Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
meanw <- mean(ncbirths$weight)
sdw <- sd(ncbirths$weight)
(ssw <- table(!is.na(ncbirths$weight)))

## 
## TRUE 
## 1000

The mean of the birthweight of North Carolina babies is 7.101
The standard deviation of the birthweight of North Carolina babies is 1.51
The ample size of the birthweight of Northcarolina babies is 1000.

# Test statistic
(meanw - 7.7) / (sdw / sqrt(1000))

## [1] -12.55388

The t-score is equal to -12.55388

# Probability of test statistic by chance
pt(-12.55388, df = 999)*2

## [1] 1.135354e-33

The p-value is 1.135354e-33
Decision: The p-value is very low so we reject the null hypothesis (\(H_0\))

Conclusion
Statement: The data suggest that there is a difference betwen the birthweight of European babies and the birthweight of North Carolina babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses

\(H_0: \mu = 0\)
\(H_A: \mu \neq 0\)

Test by confidence interval or p-value and decision

# Creating a diff variable
ncbirths$diff <- ncbirths$fage - ncbirths$mage

# finding the mean, standard deviation and sample size
meand <- mean(ncbirths$diff, na.rm = TRUE)
sdd <- sd(ncbirths$diff, na.rm = TRUE)
(ssd <- table(!is.na(ncbirths$diff))[[2]])

## [1] 829

The mean of the age difference is 2.65
The standard deviation of the age difference is 4.32
The sample size is 829.

# Test Statistics
(meand - 0) / (sdd/sqrt(829))

## [1] 17.6727

The t-score is 17.6727

# Probabilty of test statistics by chance
pt(17.6727, df = 828, lower.tail = FALSE)*2

## [1] 1.504649e-59

The p-value is equal to 1.504649e-59
Decision: The p-value is very low therefore we reject the null hypothesis (\(H_0\))

Conclusion
Statement: The data suggests that there is a difference betwwen the mean age of the fathers and the mothers.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses

\(H_0: \mu_1 - \mu_2 = 0\)
\(H_A: \mu_1 - \mu_2 \neq 0\)

Test by confidence interval or p-value and decision

# smokers subset
smokers <- subset(ncbirths, ncbirths$habit == "smoker")

# non smoker subset
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

# Finding the mean, standard deviation and sample size of each
means <- mean(smokers$weeks, na.rm = TRUE)
meannons <- mean(nonsmokers$weeks, na.rm = TRUE)

sds <- sd(smokers$week, na.rm = TRUE)
sdnons <- sd(nonsmokers$weeks, na.rm = TRUE)

(sss <- table(!is.na(smokers$weeks)))

## 
## TRUE 
##  126

(ssnons <- table(!is.na(nonsmokers$weeks))[[2]])

## [1] 872

Mean pregnancy lenght for smokers is 38.44
Standard deviation pregnancy lenght for smokers is 2.47
The sample size for the pregnancy lenght for smokers is 126
Mean pregnancy lenght for nonsmokers is 38.32
Standard deviation pregnancy lenght for nonsmokers is 2.99
The sample size for the pregnancy lenght for nonsmokers is 873

# standard error 
sqrt((sds^2/126) + (sdnons^2/873))

## [1] 0.2420528

The standard error is 0.2420528

# test statistics
((means - meannons)-0) / 0.2420528

## [1] 0.5190483

t-score is 0.5190483

# p-value
pt(0.5190483, df = 125, lower.tail = FALSE)*2

## [1] 0.6046448

The p-value is 0.6046448
Decision: The p-value is more than alpha so we fail to reject the null hypothesis.

Conclusion
Statement: There is not sufficient evidence to suggest that there is a big difference between the lenght pregnancy for a smoker and a nonsmoker.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\(H_0: \mu_1 = \mu_2\)
\(H_A: \mu_1 > \mu_2\)
Test by confidence interval or p-value and decision

# younger mothers subset
youngermoms <- subset(ncbirths, ncbirths$mature == "younger mom")

# mature mothers subset
maturemoms <- subset(ncbirths, ncbirths$mature == "mature mom")

# Finding the mean, standard deviation and sample size
meany <- mean(youngermoms$gained, na.rm = TRUE)
meanm <- mean(maturemoms$gained, na.rm = TRUE)

sdy <- sd(youngermoms$gained, na.rm = TRUE)
sdm <- sd(maturemoms$gained, na.rm = TRUE)

(ssy <- table(!is.na(youngermoms$gained)))

## 
## FALSE  TRUE 
##    23   844

(ssm <- table(!is.na(maturemoms$gained)))

## 
## FALSE  TRUE 
##     4   129

Mean weight gained for younger mothers 30.56
Standard deviation of the weight gained for younger mothers 14.35
Mean weight gained for mature mothers 28.79
Standard deviation of the weight gained for mature mothers 13.48
younger moms sample size 844
Mature moms smaple size 129

# Standard error
sqrt((sdy^2/844) + (sdm^2/129))

## [1] 1.285689

The standard error is 1.285689

# test statistics
((meany - meanm) - 0) / 1.285689

## [1] 1.376483

The t-score is 1.376483

# p-value
pt(1.376483, df = 128, lower.tail = FALSE)*2

## [1] 0.1710753

The p-value is 0.1710753
Decision: We fail to reject the null hypothesis

Conclusion
Statement: There is not sufficient evidence to suggest that the weight gained by younger moms is greater than the weight gained by mature moms.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

# creating subsets
low <- subset(ncbirths, ncbirths$lowbirthweight == "low")
not_low <- subset(ncbirths, ncbirths$lowbirthweight == "not low")

# Analyzing birthweight
mean(low$weight)

## [1] 4.034775

mean(not_low$weight)

## [1] 7.483847

Using the mean function, I can say on average the birthweight for the low pregnancies is less than the ones that aren’t low.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
Is there a difference in the number of visits for younger mothers and mature mothers?
Write hypotheses
\(H_0: \mu = 0\)
\(H_A: \mu \neq 0\)
Test by confidence interval or p-value and decision

# Younger mothers subset 
youngermothers <- subset(ncbirths, ncbirths$mature == "younger mom")

# Mature mothers subset
maturemothers <- subset(ncbirths, ncbirths$mature == "mature mom")

# Finding the mean, standard deviation and sample size
(meanym <- mean(youngermothers$visits, na.rm = TRUE))

## [1] 12.02791

(meanmm <- mean(maturemothers$visits, na.rm = TRUE))

## [1] 12.61069

(sdym <- sd(youngermothers$visits, na.rm = TRUE))

## [1] 3.883239

(sdmm <- sd(maturemothers$visits, na.rm = TRUE))

## [1] 4.379274

table(!is.na(youngermothers$visits))

## 
## FALSE  TRUE 
##     7   860

table(!is.na(maturemothers$visits))

## 
## FALSE  TRUE 
##     2   131

# Standard Error
sqrt((sdmm^2/131)+(sdym^2/860))

## [1] 0.4048847

# test statistics
((meanmm - meanym)-0) / 0.4048847

## [1] 1.439373

# Probability
pt(1.439373, df = 130, lower.tail = FALSE)*2

## [1] 0.1524484

p-value is equal to 0.1524484
Decision: We fail to reject \(H_0\)

Conclusion
Statement: There is not sufficient evidence to suggest a large difference between the number of visits for younger moms and mature moms.