Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Instructions

Update the author line at the top to have your name in it.
You must knit this document to an html file and publish it to RPubs. Once you have published your project to the web, you must copy the url link into the appropriate Course Project assignment in MyOpenMath before 9:00am on the due date.
Answer all the following questions completely. Some may ask for written responses.
Use R chunks for code to be evaluated where needed and always comment all of your code so the reader can understand what your code aims to accomplish.
Proofread your knitted document before publishing it to ensure it looks the way you want it to. Use double spaces at the end of a line to create a line break and make sure text does not have a header label that isn’t supposed to.
Delete these instructions from your published project.
Question #1 and #3 are to be done by the template laid out. You may choose to complete other questions similarly or by use of a valid shortcut.
Question #3 must be a hypothesis test by p-value. The remaining t-test questions may be done by your choice of either p-value or confidence interval.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Determine NAs in ncbirths$gained
table(is.na(ncbirths$gained))

## 
## FALSE  TRUE 
##   973    27

# Store mean, standard deviation, and sample size of dataset
mngain <- mean(ncbirths$gained, na.rm = TRUE)
sdgain <- sd(ncbirths$gained, na.rm = TRUE)
lengthgain <- 973
dfgain <- 973 - 1

# mn, sd, df ,ss
mngain

## [1] 30.3258

sdgain

## [1] 14.2413

lengthgain

## [1] 973

dfgain

## [1] 972

# Calculate t-critical value for 90% confidence
tcrit <- abs(qt(0.05, dfgain))

The t-critical value is 1.65

# Calculate margin of error
megain <- tcrit*(sdgain/sqrt(lengthgain))

The Margin of error is 0.75

# Boundaries of confidence interval
mngain - megain

## [1] 29.57411

mngain + megain

## [1] 31.07748

We are 90% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.57 and 31.07 pounds.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

# Calculate t-critical value for 95% confidence
tcon <- abs(qt(.025, dfgain))

The t-critical value is 1.96

# Margin of Error
moe <- tcon*(sdgain/sqrt(lengthgain))

Margin of Error is 0.89

# Boundaries of confidence interval
mngain - moe

## [1] 29.42985

mngain + moe

## [1] 31.22174

We are 95% confident that the average weight gained by North Carolina Mothers during pregnancy is between 29.43 pounds and 31.22 pounds.

How does that confidence interval compare to the one in Question #1?

# 90% confidence interval difference
31.07 - 29.57

## [1] 1.5

# 95% confidence interval difference
31.22 - 29.43

## [1] 1.79

The confidence interval’s range is bigger because we are more confident that the mu is between the intervals we found.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
\(H_0\): \(\mu\) = 7.7 \(H_A\): \(\mu\) \(\neq\) 7.7
Test by p-value and decision

# Determine NAs in ncbirths$weight
table(is.na(ncbirths$weight))

## 
## FALSE 
##  1000

# Sample statistics (sample mean, standard deviation, and size)
mnbweight <- mean(ncbirths$weight)
sdbweight <- sd(ncbirths$weight)
lengthbweight <- 1000
dfbweight <- 1000-1

# Test statistic
teststat <- (mnbweight - 7.7) / (sdbweight/sqrt(lengthbweight))
teststat

## [1] -12.55388

# Probability of test statistic by chance
pt(teststat, dfbweight)*2

## [1] 1.135415e-33

Conclusion
We must reject the null hypothesis in favor of the alternate hypothesis because the p-value is very low. There is a difference in average birth weights of North Carolina babies and European babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
\(H_0\): \(\mu\) = 0 \(H_A\): \(\mu\) \(\neq\) 0
Test by confidence interval or p-value and decision

# Calculate Age Difference and Store mean, standard deviation, and sample size of parents age
ncbirths$diff <- ncbirths$fage - ncbirths$mage
table(is.na(ncbirths$diff))

## 
## FALSE  TRUE 
##   829   171

mndiff <- mean(ncbirths$diff, na.rm = TRUE)
sddiff <- sd(ncbirths$diff, na.rm = TRUE)
lengthdiff <- 829
dfdiff <- 829 - 1

# Test stat
statdiff <- (mndiff-0) / (sddiff/sqrt(lengthdiff))
statdiff

## [1] 17.6727

# Probability Test
pt(statdiff, dfdiff, lower.tail = FALSE)*2

## [1] 1.504608e-59

Conclusion
We found that the p-value is smaller than alpha, so we must reject the null hypothesis in favor of the alternative hypothesis. The data suggests that there is a significant difference between the mean age of mothers and fathers in the ncbirths data set.

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
\(H_0\): \(\mu_1\) - \(\mu_2\) = 0 \(H_A\): \(\mu_1\) - \(\mu_2\) \(\neq\) 0
Test by confidence interval or p-value and decision

# Create subsets
smokers <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

# Calculate Week Difference and Store mean, standard deviation, and sample size of pregnancy lengths depending on smoking habit 
table(is.na(smokers$weeks))

## 
## FALSE 
##   126

table(is.na(nonsmokers$weeks))

## 
## FALSE  TRUE 
##   872     1

mnsmokersW <- mean(smokers$weeks, na.rm = TRUE)
mnnonsmokersW <- mean(nonsmokers$weeks, na.rm = TRUE)
sdsmokersW <- sd(smokers$weeks, na.rm = TRUE)
sdnonsmokersW <- sd(nonsmokers$weeks, na.rm = TRUE)
lengthsmokersW <- 126
lengthnonsmokersW <- 872
dfsmokersW <- 126 - 1
dfnonsmokersW <- 872 - 1

# mean of difference in weeks
mndiffW <- mnsmokersW - mnnonsmokersW
mndiffW

## [1] 0.1256371

# Standard Error of Weeks
seW <- sqrt((sdsmokersW^2 /lengthsmokersW) + (sdnonsmokersW^2 / lengthnonsmokersW))

# Test Stat
teststatW <- (mndiffW - 0) / seW
teststatW

## [1] 0.5189962

  # p-value
pt(teststatW, dfnonsmokersW, lower.tail = FALSE)*2

## [1] 0.6038953

Conclusion The p-value is greater than alpha, therefore we can not reject the null hypothesis. The data suggests there is no difference in the average length of pregnancies for a smoker and a nonsmoker in North Carolina.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\(H_0\): \(\mu_1\) - \(\mu_2\) = 0 \(H_A\): \(\mu_1\) - \(\mu_2\) \(\neq\) 0
Test by confidence interval or p-value and decision

# create subsets
maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")
youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")
dfymom <- 128

# Mean difference in weights 
mndiffM <- abs(mean(maturemom$gained, na.rm = TRUE) - mean(youngermom$gained, na.rm = TRUE))

# Sample Size
table(is.na(maturemom$gained))

## 
## FALSE  TRUE 
##   129     4

table(is.na(youngermom$gained))

## 
## FALSE  TRUE 
##   844    23

# Standard Error
seM <- sqrt((sd(maturemom$gained, na.rm = TRUE)^2/133)+(sd(youngermom$gained, na.rm = TRUE)^2/867))

# Test Stat
teststatM <- (mndiffM-0)/seM

# p-value
pt(teststatM, dfymom, lower.tail=FALSE)

## [1] 0.08237283

Conclusion
Since the p-value is greater thatn alpha, we cannot reject the null hypothesis. Based on the data, there is no significant difference in the amount of weight gained by mature mothers and younger mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

# create subsets
smallBaby <- subset(ncbirths, ncbirths$lowbirthweight == "low")
bigBaby <- subset(ncbirths, ncbirths$lowbirthweight == "not low")

# summary of weights of babies
summary(smallBaby$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.095   4.560   4.035   5.160   5.500

summary(bigBaby$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.560   6.750   7.440   7.484   8.130  11.750

table(ncbirths$lowbirthweight)

## 
##     low not low 
##     111     889

table(smallBaby$premie)

## 
## full term    premie 
##        30        80

table(bigBaby$premie)

## 
## full term    premie 
##       816        72

I created two subsets for the baby’s weight, a low weight and a not low weight subsets. I used the summary function to determine what was the lowest and highest for each subset. Anything 5.559 pounds and under was considered “low” and anything 5.560 poounds and above was considered “not low”.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
Is there a significant difference the length of pregnancy between younger mothers and mature mothers.
Write hypotheses
\(H_0\): \(\mu_1\) = 0 \(H_A\): \(\mu_1\) \(\neq\) 0
Test by confidence interval or p-value and decision

# Create subsets and store mean and standard deviation 
maturemom <- subset(ncbirths, ncbirths$mature == "mature mom")
youngermom <- subset(ncbirths, ncbirths$mature == "younger mom")
mnweeksYM <- mean(youngermom$weeks, na.rm = TRUE)
mnweeksMM <- mean(maturemom$weeks, na.rm = TRUE)
sdweeksYM <- sd(youngermom$weeks, na.rm = TRUE)
sdweeksMM <- sd(maturemom$weeks, na.rm = TRUE)
mndiffweeks <- (mnweeksMM - mnweeksYM)

# Determine if any NAs
table(is.na(maturemom$weeks))

## 
## FALSE  TRUE 
##   132     1

table(is.na(youngermom$weeks))

## 
## FALSE  TRUE 
##   866     1

# Standard error
seMOMweeks <- sqrt((sdweeksMM^2/132)+(sdweeksYM^2/866))
seMOMweeks

## [1] 0.2967804

# Test Stat
weeksSTAT <- abs((mndiffweeks- 0) / seMOMweeks)
weeksSTAT

## [1] 1.211299

# Find p-value
pt(weeksSTAT, df = 131, lower.tail = FALSE)*2

## [1] 0.2279614

Conclusion

The p-value is 0.228 which is less than alpha. This means we reject null hypothesis in favor of the alternative hypothesis. There is a difference in pregnancy lengths of younger and mature moms.