Project #5 - Inference on Numerical Data

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mn <- mean(ncbirths$gained, na.rm = TRUE)

stdev <- sd(ncbirths$gained, na.rm = TRUE)

table(is.na(ncbirths$gained))

## 
## FALSE  TRUE 
##   973    27

The mean is 30.33, the standard deviation is 14.24, and the sample size is 973.

# Calculate t-critical value for 90% confidence
t <- abs(qt(p = 0.05, df = 972))

The t-critical value is 1.65.

# Calculate margin of error
me <- t*(stdev/sqrt(973))
   
se <- stdev/sqrt(973)

The margin of error is 0.75.

# Boundaries of confidence interval
mn - t * se

## [1] 29.57411

mn + t * se

## [1] 31.07748

We are 90% confident that the average weight gained by North Carolina mothers during pregnancy is between 29.57 pounds and 31.07 pounds.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

# The mean, standard deviation, sample size and standard error remain the same. The t-critical value and the boundaries change.

# T-critical value
t5 <- abs(qt(p = 0.025, df = 972))

# Boundaries
mn - t5 * se

## [1] 29.42985

mn + t5 * se

## [1] 31.22174

We are 95% confident that the average weight gained by North Carolina mothers during pregnancy is between 29.43 pounds and 31.22 pounds.

How does that confidence interval compare to the one in Question #1?

# 90% confidence interval difference
31.07-29.57

## [1] 1.5

# 95% confidence interval difference
31.22-29.43

## [1] 1.79

The range (the confidence interval) is wider/larger because we are getting more sure that the mean is between these certain numbers.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses

Ho: \(\mu = 7.7\) Ha:\(\mu \neq 7.7\)
Two-tailed

Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mnB <- mean(ncbirths$weight)
stdevB <- sd(ncbirths$weight)
table(is.na(ncbirths$weight))

## 
## FALSE 
##  1000

The mean is 7.1 pounds, standard deviation is 1.5 pounds, and sample size is 1,000 babies.

# Test statistic
(mnB-7.7) / (stdevB/sqrt(1000))

## [1] -12.55388

# Probability of test statistic by chance

pt(-12.55388, df = 999)*2

## [1] 1.135354e-33

The p-value is 1.135e-33 which is a very tiny number.

Conclusion

Since the p-value is very low here, we reject the null hypothesis in favor of the alternate hypothesis. In other words, the data suggests there is some difference in the weight of newborn North Carolina babies and newborn European babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses

Ho: \(\mu = 0\) Ha:\(\mu \neq 0\) Two-tailed test

Test by confidence interval or p-value and decision

# column of differences
ncbirths$diff <- (ncbirths$fage) - (ncbirths$mage)

# test statistic
(mean(ncbirths$diff, na.rm = TRUE)-0) / (sd(ncbirths$diff, na.rm = TRUE)/sqrt(1000))

## [1] 19.41001

# probability of getting that test statistic
pt(19.41, df = 999, lower.tail = FALSE)*2

## [1] 1.840431e-71

Conclusion

Since our p-value is much smaller than alpha, we will reject the null hypothesis in favor of the alternative hypothesis. In other words, the data suggests there is a significant difference between the mean age of mothers and fathers in the ncbirths dataset.

Question 5 - Two Independent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses

\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\) two-tailed

Test by confidence interval or p-value and decision

# subsets
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

# mean difference in weeks
meandiffW <- mean(smokers$weeks, na.rm = TRUE) - mean(nonsmokers$weeks, na.rm = TRUE)

# sample sizes
table(is.na(smokers$weeks))

## 
## FALSE 
##   126

table(is.na(nonsmokers$weeks))

## 
## FALSE  TRUE 
##   872     1

# standard error of weeks
SEW <- sqrt((sd(smokers$weeks, na.rm = TRUE)^2/126)+(sd(nonsmokers$weeks, na.rm = TRUE)^2/872))

# test statistic
(meandiffW-0) / SEW

## [1] 0.5189962

# p-value
pt(-0.5189962, df = 125)*2

## [1] 0.604681

Conclusion

Since the p-value is greater than alpha, we cannot reject the null hypothesis. In other words, the data suggests there is no significant difference in the length of pregnancy for a smoker and a nonsmoker in North Carolina.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses

\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\) two-tailed

Test by confidence interval or p-value and decision

# subsets
maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")
youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")

# Mean difference in weights
meandiffM <- mean(maturemom$weight) - mean(youngermom$weight)

# sample size
table(is.na(maturemom$weight))

## 
## FALSE 
##   133

table(is.na(youngermom$weight))

## 
## FALSE 
##   867

# Standard Error for Maturity
seM <- sqrt((sd(maturemom$weight)^2/133)+(sd(youngermom$weight)^2/867))

# test statistic
(meandiffM-0)/seM

## [1] 0.1858449

# p-value
pt(-0.1858449, df=132)*2

## [1] 0.8528517

Conclusion

Since the p-value is greater than alpha, we cannot reject the null hypothesis. In other words, there is no significant difference in the amount of weight gained by mature mothers and younger mothers during pregnancy based on this data.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

# make subsets
small <- subset(ncbirths, ncbirths$lowbirthweight == "low")
big <- subset(ncbirths, ncbirths$lowbirthweight == "not low")

# look at the max and min values
summary(small$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.095   4.560   4.035   5.160   5.500

summary(big$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.560   6.750   7.440   7.484   8.130  11.750

For this, I just looked at the maximum weight a newborn baby could be to be considered a “low” birth weight. I also looked at the minimum number of pounds a baby would have to weigh to be considered a “not low” birth weight.

5.500 pounds and less is a “low” birth weight. 5.560 and above is a “not low” birth weight.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question

Determine if younger moms tend to visit the hospital more during pregnancy than mature moms.

Write hypotheses

\(H_0: \mu_1 - \mu_2 = 0\) \(H_A: \mu_1 - \mu_2 \neq 0\) two-tailed

Test by confidence interval or p-value and decision

# sample sizes
table(is.na(maturemom$visits))

## 
## FALSE  TRUE 
##   131     2

table(is.na(youngermom$visits))

## 
## FALSE  TRUE 
##   860     7

# standard error of visits
seV <- sqrt((sd(maturemom$visits, na.rm = TRUE)^2/131)+(sd(youngermom$visits, na.rm = TRUE)^2/860))

# mean difference in visits
meandiffV <- mean(maturemom$visits, na.rm = TRUE) - mean(youngermom$visits, na.rm = TRUE)

# test statistic
(meandiffV-0)/seV

## [1] 1.439373

# p-value
pt(-1.439, df=130)*2

## [1] 0.1525539

Conclusion

The p-value is greater than alpha so we cannot reject the null hypothesis. In other words, this data does not prove that younger moms visit the hospital any more during pregnancy than mature (older) moms do.

Project #5 - Inference on Numerical Data

MAT143H - Introduction to Statistics Honors

Christina Pace

Due: Wednesday, April 4

Preparation

Question 1 - Single Sample t-confidence interval

Question 2 - Single Sample t-confidence interval

Question 3 - Single Sample t-test

Question 4 - Paired Data t-test

Question 5 - Two Independent Sample t-test

Question 6

Question 7

Question 8