Project #5 - Inference on Numerical Data

Purpose

In this project, students will demonstrate their understanding of the inference on numerical data with the t-distribution. If not specifically mentioned, students will assume a significance level of 0.05.

Preparation

Store the ncbirths dataset in your environment in the following R chunk. Do some exploratory analysis using the str() function, viewing the dataframe, and reading its documentation to familiarize yourself with all the variables. None of this will be graded, just something for you to do on your own.

# Load Openintro Library
library(openintro)

# Store ncbirths in environment
ncbirths <- ncbirths

Question 1 - Single Sample t-confidence interval

Construct a 90% t-confidence interval for the average weight gained for North Carolina mothers and interpret it in context.

# Store mean, standard deviation, and sample size of dataset
mean.gained <- mean(ncbirths$gained, na.rm = TRUE)
sd.gained <-sd(ncbirths$gained, na.rm = TRUE)
table(is.na(ncbirths$gained))

## 
## FALSE  TRUE 
##   973    27

sample.size <- 973

# Calculate t-critical value for 90% confidence
abs(qt(.05, df = 972))

## [1] 1.646423

# Calculate margin of error
1.646423 * 14.24129662/sqrt(973)

## [1] 0.7516827

# Boundaries of confidence interval
#Lower bound
30.3258 - 0.7516827

## [1] 29.57412

#Upper Bound 
30.3258 + 0.7516827

## [1] 31.07748

We are 90% Confident that the avarage weight gained for North Calorina mothers is between 29.57 pounds and 31.08 pounds.

Question 2 - Single Sample t-confidence interval

Construct a new confidence interval for the same parameter as Question 1, but at the 95% confidence level.

#Calculate t-critical value for 95% confidence
abs(qt(.025, df = 972))

## [1] 1.962408

#Boundaries of confidence interval
#Lower bound
30.3258 - 1.962408*14.24129662/sqrt(973)

## [1] 29.42985

#Upper bound
30.3258 + 1.962408*14.24129662/sqrt(973)

## [1] 31.22175

We are 95% Confident that the avarage weight gained for North Calorina mothers is between 29.43 pounds and 31.22 pounds.

How does that confidence interval compare to the one in Question #1? The confidence intervals are very similar to each other, they only differ by a few tenths of a pound for both the lower and upper interval.

Question 3 - Single Sample t-test

The average birthweight of European babies is 7.7 lbs. Conduct a hypothesis test by p-value to determine if the average birthweight of NC babies is different from that of European babies.

Write hypotheses
\(H0: \mu = 7.7\) \(HA: \mu \not = 7.7\)
Test by p-value and decision

# Sample statistics (sample mean, standard deviation, and size)
mean.weight <- mean(ncbirths$weight, na.rm = TRUE)
sd.weight <- sd(ncbirths$weight, na.rm = TRUE)
table(is.na(ncbirths$weight))

## 
## FALSE 
##  1000

# Test statistic
(7.101 - 7.7)/(1.50886/sqrt(1000))

## [1] -12.55388

The t-score is -12.55388

# Probability of test statistic by chance
pt(-12.55388, df = 999)*2

## [1] 1.135354e-33

Decision: The p- value is less than 0.05, so we reject the Ho (Null Hypothesis).

Conclusion
The data suggests there is a difference between the average birthweight of NC babies and European babies.

Question 4 - Paired Data t-test

In the ncbirths dataset, test if there is a significant difference between the mean age of mothers and fathers.

Write hypotheses
\(Ho: \mu = 0\) \(HA: \mu \not = 0\)
Test by confidence interval or p-value and decision

#Create a column of differences  
ncbirths$diff <- ncbirths$fage - ncbirths$mage

#Sample statistics (sample mean, standard deviation, and size)
mean(ncbirths$diff, na.rm = TRUE)

## [1] 2.652593

sd(ncbirths$diff, na.rm = TRUE)

## [1] 4.321604

table(is.na(ncbirths$diff))

## 
## FALSE  TRUE 
##   829   171

# Calculating the t-score
(mean(ncbirths$diff,na.rm = TRUE)-0)/(sd(ncbirths$diff,na.rm = TRUE)/sqrt(829))

## [1] 17.6727

#Probability of getting the test statistic
pt(17.6727, df = 828, lower.tail = FALSE)*2

## [1] 1.504649e-59

Decision: The p- value is less than 0.05, so we reject the null hypothesis.

Conclusion
The data suggests there is a difference between the mean age of the mothers and fathers

Question 5 - Two Indendent Sample t-test

In the ncbirths dataset, test if there is a significant difference in length of pregnancy between smokers and non-smokers.

Write hypotheses
\(H0: \mu1-\mu2 = 0\) \(HA: \mu1-\mu2 \not = 0\)
Test by confidence interval or p-value and decision

#This stores smokers and nonsmokers subsets
smoker <- subset(ncbirths, ncbirths$habit == "smoker")
nonsmokers <- subset(ncbirths, ncbirths$habit == "nonsmoker")

#Finds the mean for the smokers and non smokers
mean.smoker <- mean(smoker$weeks, na.rm = TRUE)
mean.nonsmokers <- mean(nonsmokers$weeks, na.rm = TRUE)

#finds the standard deviation for smokers and non smokers. 
sd.smoker <- sd(smoker$weeks, na.rm = TRUE)
sd.nonsmokers <- sd(nonsmokers$weeks, na.rm = TRUE)
summary(ncbirths$habit)

## nonsmoker    smoker      NA's 
##       873       126         1

#Finds the standard error 
SE <- sqrt((sd.smoker^2/126)+(sd.nonsmokers^2/873))

#Finds test statistic 
(((mean.smoker)-(mean.nonsmokers))-0)/SE

## [1] 0.5190483

#Probability of getting the test statistic
pt(0.5190483, df = 125, lower.tail = FALSE)*2

## [1] 0.6046448

Decision: The p-value is above 0.05 so we fail to reject the null hypothesis.

c.Conclusion
There is not sufficient data to suggest that there is significant difference in length of pregnancy between smokers and non-smokers.

Question 6

Conduct a hypothesis test at the 0.05 significance level evaluating whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Write hypotheses
\(H0: \mu1-\mu2 = 0\) \(HA: \mu1-\mu2 \not = 0\)
Test by confidence interval or p-value and decision

#Young mother and mature mothers subset.
y.mothers <- subset(ncbirths, ncbirths$mature == "younger mom")
m.mothers <- subset(ncbirths, ncbirths$mature == "mature mom")

#Sample statistics (sample mean, standard deviation, and size)
mean.y.mothers <- mean(y.mothers$gained, na.rm = TRUE)
mean.m.mothers <- mean(m.mothers$gained, na.rm = TRUE)
sd.y.mothers <- sd(y.mothers$gained, na.rm = TRUE)
sd.m.mothers <- sd(m.mothers$gained, na.rm = TRUE)
table(is.na(y.mothers$gained))

## 
## FALSE  TRUE 
##   844    23

table(is.na(m.mothers$gained))

## 
## FALSE  TRUE 
##   129     4

#Create a column of differences
agediff.gained <- (mean.y.mothers-mean.m.mothers)

#Finds the standard error 
SE.age.diff <- sqrt((sd.y.mothers^2/844)+(sd.m.mothers^2/129))

#test statistic
(1.7697 - 0)/1.2857

## [1] 1.376449

(agediff.gained-0)/SE.age.diff

## [1] 1.376483

#P-value
pt(1.376486, df = 128, lower.tail = FALSE)

## [1] 0.08553716

Decision: P Value is above 0.05 therefore we fail to reject H0.

Conclusion
There is not sufficient data to suggest whether the average weight gained by younger mothers is more than the average weight gained by mature mothers.

Question 7

Determine the weight cutoff for babies being classified as “low” or “not low”. Use a method of your choice, describe how you accomplished your answer, and explain as needed why you believe your answer is accurate. (This is a non-inference task)

#table of low birthweight variable
table(ncbirths$lowbirthweight)

## 
##     low not low 
##     111     889

#Subsets of low and not low birthweight
low<-subset(ncbirths,ncbirths$lowbirthweight=="low")
notlow<-subset(ncbirths,ncbirths$lowbirthweight=="not low")

#Fivenumber summary
fivenum(low$weight)

## [1] 1.000 3.095 4.560 5.160 5.500

fivenum(notlow$weight)

## [1]  5.56  6.75  7.44  8.13 11.75

The weight cutoff for babies being classified as low is 5.50. To get the answer, I created subsets of the low and not low birthweight and found the five number summary for both. This gives us the minimum value, the first quartile, the median, the third quartile, and the maximum value for each subset. From this we see that the maximum value for “low” is 5.50. I believe this is accurate because looking at the table of data, all babies with a weight of 5.50 and under are considered low. There is a possibility that true cut-off is between 5.50 and 5.56 lbs and that there is not enough data to determine the true cut-off.

Question 8

Pick a pair of numerical and categorical variables from the ncbirths dataset and come up with a research question evaluating the relationship between these variables. Formulate the question in a way that it can be answered using a hypothesis test and/or a confidence interval.

Question
Is there difference in women’s length of pregancy between younger mothers and mature mothers?
Write hypotheses

\(H0: \mu1-\mu2 = 0\) \(HA: \mu1-\mu2 \not = 0\)

Test by confidence interval or p-value and decision

#Mature and younger mom subsets
mature.mom <- subset(ncbirths, ncbirths$mature == "mature mom")
younger.mom <-subset(ncbirths, ncbirths$mature == "younger mom")

#Mean and standard deviation of sample
mean.y.length <- mean(mature.mom$weeks, na.rm = TRUE)
mean.m.length <- mean(younger.mom$weeks, na.rm = TRUE)

sd.mature.length <- sd(mature.mom$weeks, na.rm = TRUE)
sd.younger.length <-sd(younger.mom$weeks, na.rm = TRUE)

#Mean difference length in mature and younger mom
mean.diff.length <- (mean.m.length - mean.y.length)

#Sample size
table(is.na(mature.mom$weeks))

## 
## FALSE  TRUE 
##   132     1

table(is.na(younger.mom$weeks))

## 
## FALSE  TRUE 
##   866     1

#Standard error
SE.age.length <- sqrt((sd.mature.length^2/132) + (sd.younger.length^2/866))

#Test Statistic
(mean.diff.length - 0)/SE.age.length

## [1] 1.211299

(0.35948981 - 0)/0.29678

## [1] 1.211301

#Probability of getting the test statistic
 pt(1.211299, df = 131, lower.tail = FALSE)*2

## [1] 0.2279614

Decision: The p value is 0.228 which is above 0.05, so we fail to reject the H0.

Conclusion
There is not significant data to show a difference in the length of pregnacy between mature moms and younger moms.